A review of the application of deep learning in medical image classification and segmentation

Correspondence to: Jingyang Gao. College of Information Engineering and Technology, Beijing University of Chemical Technology, No. 15 East North Third Ring Road, Beijing, China. Email: nc.ude.tcub.liam@yjoag; Di Zhao. Institute of Computing Technology, Chinese Academy of Sciences, No. 6 South Road, Zhongguancun Academy of Sciences, Beijing, China. Email: nc.ecneicse@idoahz.

Received 2019 Sep 16; Accepted 2020 Feb 6. Copyright 2020 Annals of Translational Medicine. All rights reserved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0.

Associated Data

The article’s supplementary files as

DOI: 10.21037/atm.2020.02.44

Abstract

Big medical data mainly include electronic health record data, medical image data, gene information data, etc. Among them, medical image data account for the vast majority of medical data at this stage. How to apply big medical data to clinical practice? This is an issue of great concern to medical and computer researchers, and intelligent imaging and deep learning provide a good answer. This review introduces the application of intelligent imaging and deep learning in the field of big data analysis and early diagnosis of diseases, combining the latest research progress of big data analysis of medical images and the work of our team in the field of big data analysis of medical imagec, especially the classification and segmentation of medical images.

Keywords: Big medical data, deep learning, classification, segmentation, object detection

Introduction

Since 2006, deep learning has emerged as a branch of the machine learning field in people’s field of vision. It is a method of data processing using multiple layers of complex structures or multiple processing layers composed of multiple nonlinear transformations (1). In recent years, deep learning has made breakthroughs in the fields of computer vision, speech recognition, natural language processing, audio recognition and bioinformatics (2). Deep learning has been praised as one of the top ten technological breakthroughs since 2013 due to its considerable application prospects in data analysis. The deep learning method simulates the human neural network. By combining multiple nonlinear processing layers, the original data is abstracted layer by layer, and different levels of abstract features are obtained from the data and used for target detection, classification or segmentation. The advantage of deep learning is to replace the manual acquisition feature with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms (3).

Medical care is about the health of people. At present, the amount of medical data is huge, but it is crucial to make good use of this huge medical data to contribute to the medical industry. Although the amount of medical data is huge, there are still many problems: medical data is diverse, including maps, texts, videos, magnets, etc.; due to different equipment used, the quality of data varies greatly; data presents fluctuating characteristics, over time and specific events change; due to differences in individuals, the law of the disease has no universal applicability (4). There are many factors that cannot be dealt with in the existence of these problems. Medical imaging is a very important part of medical data.

This paper first introduces the application of deep learning algorithms in medical image analysis, expounds the techniques of deep learning classification and segmentation, and introduces the more classic and current mainstream network models. Then we detailed the application of deep learning in the classification and segmentation of medical images, including fundus, CT/MRI tomography, ultrasound and digital pathology based on different imaging techniques. Finally, it discusses the possible problems and predicts the development prospects of deep learning medical imaging analysis.

Deep learning architectures

Deep learning algorithms

Deep learning has developed into a hot research field, and there are dozens of algorithms, each with its own advantages and disadvantages. These algorithms cover almost all aspects of our image processing, which mainly focus on classification, segmentation. Figure 1 is an overview of some typical network structures in these areas.

An external file that holds a picture, illustration, etc. Object name is atm-08-11-713-f1.jpg

The typical network structures of the deep learning.

Classification

Using deep learning for image classification is earliest rise and it also a subject of prosperity. Among them, convolutional neural network (CNN) is the most widely structure. Since Krizhevsky et al. proposed AlexNet based on deep learning model CNN in 2012 (5), which won the championship in the ImageNet image classification of that year, deep learning began to explode. In 2013, Lin et al. proposed the network in network (NIN) structure, which uses global average pooling to reduce the risk of overfitting (6). In 2014, GoogLeNet and VGGNet both improved the accuracy on the ImageNet dataset (7,8). GoogLeNet has further developed the v2, v3 and v4 versions to improve performance (9-11). For the shortcomings of CNN on the input size fixed requirements, He et al. proposed spatial pyramid pooling (SPP) model to enhance the robustness of the input data (12). With the deepening of the deep learning model, He et al. proposed the residual network ResNet for the problem of model degradation that may occur, and continue to advance the deep learning technology (13).

Take AlexNet as an example. In 2012, the AlexNet adopted an 8-layer network structure consisting of five convolutional layers and three fully connected layers. After each convolution in five convolutional layers, a maximum pooling performed to reduce the amount of data. AlexNet accepts 227×227 pixels’ input data. After five rounds of convolution and pooling operations, the 6×6×256 feature matrix finally sent to the fully connected layer. The sixth layer of the fully connected layer sets up 4,096 convolution kernels, and the linear feature value of 4,096 size obtained by the dropout operation. After the last two layers, we get 1,000 float types output data, which is the final prediction result. AlexNet’s error rate in ImageNet was 15.3%, which was much higher than the 26.2% in second place. At the same time, its activation function is not sigmoid but adopted ReLU, and proved that the ReLU function is more effective.

VGG16 first proposed by VGG Group of Oxford University. Compared with AlexNet, it uses several consecutive 3×3 kernels instead of the larger convolution kernel in AlexNet like 11×11 and 5×5. For a given receptive field range, the effect of using several small convolution kernels is better than using a larger convolution kernel, because the multi-layer nonlinear layer can increase the network depth to ensure more complex patterns are learned, and the computational cost is also more small.

GoogLeNet, which launched in the same year as VGGNet, also achieved good results. Compared to VGGNet, GoogLeNet designed a module called inception. It’s a dense structure with a small number of convolution kernels of each size, and use 1×1 convolutional layer to reduce the amount of computation.

Segmentation

Semantic segmentation is an important research field of deep learning. With the rapid development of deep learning technology, excellent semantic segmentation neural networks emerge in large numbers and continuously become state-of-the-art in various segmentation competitions. Since CNN’s success in the classification field, people started to try CNN for image segmentation. Although CNN can accept images of any size as input, CNN will lose some details while pooling for extracting features, and it will lose the space information of input image due to the fully connected layers at the end of the network. So it’s difficult for CNN to pinpoint which category certain pixels belong to. With the development of deep learning technology, some segmentation networks based on convolution structure are derived.

The fully convolutional network (FCN) (14) proposed by Long et al. is the originator of the semantic segmentation networks. It replaces the fully connected layers of the classification network VGG16 with convolutional layers and retains the spatial information of the feature map and achieves pixel-level classification. Finally, FCN uses the deconvolution and fusing feature maps to restore the image, and provides the segmentation result of each pixel by softmax. Since the fully connected layer with dense connections is replaced by the convolutional layer which is locally connecting and weights sharing, the FCN greatly reduces the parameters that need to be trained. The performance of the FCN on the Pascal VOC 2012 datasets (15) has increased by 20% compared to the previous method, reaching 62.2% of the mIOU.

U-Net (16) was proposed by Olaf based on FCN, and has been widely used in medical imaging. Based on the idea of FCN deconvolution to restore image size and feature, U-Net constructs the encoder-decoder structure in the field of semantic segmentation. The encoder gradually reduces the spatial dimension by continuously merging the layers to extract feature information, and the decoder portion gradually restores the target detail and the spatial dimension according to the feature information. Among them, the step of the encoder gradually reducing the image size is called downsampling, and the step of the decoder gradually reducing the image details and size is called upsampling. Different from the fusion operation of the direct addition feature when the FCN is upsampled, the U-Net upsampling process first uses the concatenate operation to splicing the feature maps before the up-sampling of the encoder and the downsampling of the decoder. After concatenation the feature map is deconvolved. Different from the conventional convolution, pooling, and other operations, this strategy of directly utilizing shallow features is called skip connection. U-Net adopts the skip connection strategy of splicing to make full use of the features of the downsampling part of the encoder to be used for upsampling. To achieve a more refined reduction, this strategy is applied to shallow feature information of all scales to achieve a better reduction effect.

SegNet (17) is a depth semantic segmentation network designed by Cambridge to solve autonomous driving or intelligent robots, which is also based on the encoder-decoder structure. SegNet’s encoder and decoder each have 13 convolution layers. The convolutional layer of the encoder corresponds to the first 13 convolutional layers of VGG16. The upsampling part of the decoder uses UnPooling. SegNet records the element position information of the maximum pooling operation when the encoder is downsampled, and restores the image according to the position information when sampling on the decoder. SegNet with this strategy does not require learning when upsampling, and SegNet training is more accurate and faster than FCN.

In order to fuse the context information under multi-scale at the same level, PSPNet (18) proposes a pooled pyramid structure, which realizes image segmentation in which the target environment can be understood, and solves the problem that FCN cannot effectively deal with, the relationship problem between global information and scenes. Its pooled pyramid structure can aggregate context information of different regions, thereby improving the ability to obtain global information.

Deep learning development framework

While the deep learning technology is developing in theory, the software development framework based on deep learning theory is also booming.

Convolutional architecture for fast feature embedding (Caffe)

Caffe was born in Berkeley, California and now hosted by BVLC. Caffe features high-performance, seamless switching between CPU and GPU modes, and cross-platform support for Windows, Linux and Mac. Caffe has three basic atomic structures of Blobs, Layers and Nets, and its programming framework is implemented under these three atoms. It highly abstracts the structure of the deep neural network in terms of the “Layer”, and significantly optimizes the execution efficiency through some elaborate design, and it has flexibility based on maintaining efficient implementation.

Tensorflow

TensorFlow is an open source software library that uses data flow diagrams for numerical calculations. Google officially opened the computing framework TensorFlow on November 9, 2015, and officially released Google TensorFlow version 1.0 in 2017, marking its official use in the production environment. The TensorFlow calculation framework can well support various algorithms for deep learning such as CNN, RNN and LSTM, but its application is not limited to deep learning, but also supports the construction of general machine learning. TensorFlow’s components are excellent, and it provides powerful visualization capabilities through TensorBoard, which can generate very powerful visual representations of real-world network topologies and performance. At the same time, it supports heterogeneous distributed computing, which can run on multiple GPUs at the same time, and can automatically run the model on different platforms. Because TensorFlow developed in C++, it has high-performance.

PyTorch

Pytorch is the python version of torch, a neural network framework that is open sourced by Facebook and specifically targeted at GPU-accelerated deep neural network programming. Unlike Tensorflow’s static calculation graph, Pytorch’s calculation graph is dynamic, and the calculation graph can be changed in real-time according to the calculation needs. In January 2017, the Facebook Artificial Intelligence Institute (FAIR) team opened up PyTorch on GitHub and quickly occupied the top of the GitHub hotlist. PyTorch immediately attracted widespread attention as soon as it was launched, and quickly became popular in research.

High-performance computing based on GPU

The key factors of image processing in medical imaging field are imaging speed, image size and resolution. Due to the limitation of the hardware, the processing of medical images calculated according to sequence. It is also due to the lack of computing resources that the processing of these images wastes a lot of valuable time of doctors and patients. In recent years, GPU has made great progress and moved towards the direction of general computing. Its data processing capacity far exceeds that of CPU, which makes it possible to realize high-performance computing on ordinary computers.

The full name of the GPU is the Graphics Processing Unit, a microprocessor that performs image computing on PCs, workstations, game consoles and some mobile devices. In August 1999, NVIDIA released a GeForce 256 graphics chip codenamed NV10. Its architecture is very different from that of the CPU. At the beginning of its birth, it was mainly oriented to the rendering of graphic images. Like the CPU, the GPU is a processor in the graphics card that designed to perform complex mathematical and geometric calculations that are required for graphics rendering. With the GPU, CPU does not need to perform graphics processing work, and can perform other system tasks, which can greatly improve the overall performance of the computer.

Deep learning for medical imaging analysis

With the development of deep learning, computer vision uses a lot of deep learning to deal with various image problems. Medical image as a special visual image has attracted the attention of many researchers. In recent years, various types of medical image processing and recognition have adopted deep learning methods, including fundus images, endoscopic images, CT/MRI images, ultrasound images, pathological images, etc. At present, deep learning technology is mainly used in classification and segmentation in medical images. Figure 2 shows the main medical application scenarios of deep learning.

An external file that holds a picture, illustration, etc. Object name is atm-08-11-713-f2.jpg

Deep learning application in medical image analysis. (A) Fundus detection; (B,C) hippocampus segmentation; (D) left ventricular segmentation; (E) pulmonary nodule classification; (F,G,H,I) gastric cancer pathology segmentation. The staining method is H&E, and the magnification is ×40.

The classification of medical image

Diabetic retinopathy detection

In the field of deep learning, image classification and its application have made great progress this year. On the one hand, the academic circles have made great efforts to design a variety of efficient CNN models, which have achieved high accuracy and even exceeded the human recognition ability. On the other hand, the application of CNN model in medical image analysis has become one of the most attractive directions of deep learning. In particular, the retinal fundus image obtained from fundus camera has become one of the key research objects of deep learning in the field of image classification.

The main method for studying related fundus diseases using deep learning techniques is to classify and detect fundus images, such as diabetic retinopathy detection and glaucoma detection. The following Table 1 lists the deep learning methods applied and fundus image analysis in the past 3 years. These methods mainly use the large scale dataset to train deep CNN model and perform disease classification detection on fundus images. The deep CNN used to update iterations with the development of deep learning techniques, from the earliest shallow CNN model to the deep CNN model or some combination models, and the use of migration learning, data augmentation and other new methods and techniques.

Table 1

The application of deep network in the fundus detection
LiteratureTarget taskNetwork structureMethod introductionResult
Liskowski et al. 2016 (19)Vessel segmentationDeep CNN + complex data preparationProposing a supervised segmentation technique that uses a deep neural network. Using a relatively recent machine learning formalism of structured prediction to produce segmentation resultsROC >0.99; accuracy of classification >0.97
Fu et al. 2016 (20)Vessel segmentationFully CNNs + conditional random fieldsFormulating the vessel segmentation to a boundary detection problem. Using FCN and CRF to generate a vessel probability map and give a binary classification resultState-of-the-art vessel segmentation performance on the DRIVE and STARE datasets
Dasgupta et al. 2017 (21)Vessel segmentationFCNFormulating the segmentation task as a multi-label inference task and utilize the implicit advantages of the combination of CNNs and structured predictionOn DRIVE dataset; accuracy 95.33%; AUC: 0.974
Zhu et al. 2017 (22)Vessel segmentationExtreme learning machineExtracting feature vectors from pixels. Constructing matrix for pixels based on feature vectors and the manual labels. give ELM the matrix and get the binary retinal vascular segmentationAverage accuracy 0.9607; sensitivity 0.7140; specificity 0.9868
Hu et al. 2018 (23)Vessel segmentationMultiscale CNN + CRF + improved cross-entropy lossCombining with CNN and fully CRFs. Developing a multiscale CNN with an improved cross-entropy loss functionCompetitive in the sensitivity while ensuring accuracy
Fu et al. 2018 (24)Optic disc and cup segmentation for glaucoma detectionM-NetConstructing image pyramid to achieve multiple level receptive field sizes. The U-shape CNN learn the rich hierarchical representation. The side-output layer provides early classification result. Using multi-label loss function to generate final segmentation resultState-of-the-art OD and OC segmentation result on ORIGA data set

CNN, convolutional neural network; FCN, fully convolutional network.

We mainly work of detecting fundus diseases in transfer learning. It’s very difficult to obtain large-scale medical annotation set, and transfer learning is an effective method to solve the problem of small data. In order to find the potential factors between the accuracy and the type of primary models, we process transfer learning using the pre-trained model, for example CaffeNet, GoogleNet, VGG19. The experimental results show that the transfer learning based on the pre-trained CNN model is introduced to solve the problems in medical image analysis, and some effective results are produced. The Figure 3 shows our classification model of fundus.