U-Net based Network Applied to Skin Lesion Segmentation: An Ablation Study

Skin cancer is one of the types of cancer that requires an early diagnosis. The segmentation task plays a vital role in computer-aided diagnosis. Segmenting dermoscopic images is challenging for existing methods due to different image conditions. There is a significant variation in color, texture, shape, size, and location in dermoscopic images. Still, they may contain images with lighting variation and various artifacts, such as hair, ruler, air/oil bubbles, and color sample. The Convolutional Neural Network (CNN) model, U-Net, is widely used to segment dermoscopic images. This work proposes a model based on the U-Net architecture to segment dermoscopic images. Still, it presents an ablation study to justify the modifications made in the architecture, such as the number of training epochs, image size, optimization functions, dropout, and the number of convolutional blocks. Experiments were carried out on the ISIC 2017 and ISIC 2018 datasets and show that it is possible to arrive at a simple model capable of presenting competitive results compared to other state-of-the-art works with the appropriate adjustments to their parameters.


Introduction
Skin cancer is among the 19th most common cancers worldwide, with melanoma and non-melanoma being the main types [1]. Nonmelanoma is the most common type of skin cancer, but it is not very aggressive. Basal cell carcinoma and squamous cell carcinoma are the most frequent cancers. Melanoma is a more aggressive type of skin cancer, and for that reason, it has a high mortality rate. The fact that melanoma can spread to other parts of the body (metastasis) further increases the need for an early diagnosis [2]. Its identification at an early stage allows for greater effectiveness in the treatment.
Some clinical methods are used internationally to assist in the diagnosis of melanomas, such as the ABCDE [3,4] rule, and the 7-point list [5]. The ABCDE rule is based on according to the following criteria: (1) the "A" criterion is related to the asymmetry of the lesion. The lesion is divided into two parts based on axis of symmetry. If these parts are not symmetrical, the lesion may be considered suspicious; (2) the "B" criterion concerns the irregularity of the border of the lesion. A benign lesion has a normally regular border. Therefore, a lesion that contains irregular or ill border defined may be considered suspicious; (3) for the "C" criterion, the number of apparent colors in the lesion is significant. A wide range of colors in the lesion, such as red, white, black, and shades of blue-gray, can make the lesion suspicious; (4) in the "D" criterion, the diameter of the lesion is observed. Lesions with a diameter greater than or equal to 6 millimeters are considered suspicious, and (5) the "E" criterion handles evolution of the lesion, i.e., a lesion should be investigated if any modification to its structure is noted [6]. Regarding the 7-point list, For a diagnosis using a Computer-Aided Diagnostic (CAD) system [11], the "B" criterion of the ABCDE rule, which checks for irregularities at the edge of the lesion, can be analyzed according to the result of the image segmentation process. The segmentation process separates the image into parts to identify regions of interest and, from these regions, obtain relevant information. The automatic diagnosis methods proposed in recent years have shown better results to solve skin lesion segmentation problems use deep learning techniques [8]. However, a disadvantage in using such techniques is related to training, as they require many images, and acquiring dermoscopic images and labeling them by a specialist is not a simple task. Furthermore, Graphics Processing Units (GPU) are needed to decrease the processing time. For this reason, the International Skin Imaging Collaboration (ISIC) [12] provides a database containing dermoscopic images for the skin lesion segmentation task.
In a previous version of this paper [13], we proposed a method based on a U-Net with modifications in its original architecture, such as increasing the number of network layers and adjustments in the optimization and activation function. This version extends the discussion by making an ablation study of the number of training epochs, image size, optimization functions, dropout, and the number of convolutional blocks. In the case of the image size, we explore the advantages or disadvantages of working with high-resolution images. We also evaluate the values and places of the dropout in the different convolutional blocks. Section 2 presents a literature review of methods used to perform image segmentation. Traditional methods and machine learning methods are discussed. Section 3 details the U-Net model covering the proposed adjustments. It also discusses the steps used to perform the dermoscopic image segmentation task. Section 4 discusses the results of the proposed model. An ablation study is carried out to verify the model's effectiveness concerning its parameters. Section 5 presents a brief conclusion of the studies carried out.

Related Works
Image segmentation is an essential step in image skin lesion analysis, as it separates the diseased area from the healthy region. For a diagnosis using a CAD system, the segmentation task is fundamental. There are several computational methods in the literature to perform the segmentation task. These methods can be classified as conventional methods and methods based on deep learning [14]. A conventional method like K-means was applied by Alvarez et al. [15], Agarwal et al. [16] and Garg et al. [17]. In Agarwal et al. [16], the authors proposed an approach divided into three steps: (1) preprocessing; (2) segmentation, in which the K-means method is applied using two clusters, one to represent the lesion and another to represent the background skin (normal skin). The generated binary image results from the grouping submitted to an intensity-based threshold; and (3) post-processing, using the K-means algorithm output. Bibi et al. [18] also used conventional methods to perform the segmentation. A contrast enhancement in the image is applied, followed by a transformation to Hue, Saturation, Value (HSV) on the color channels. First, the method selects the best channel to generate a lesion map. Finally, the map is converted to binary.
Methods based on deep learning have demonstrated greater precision in the segmentation task in the literature. Chen et al. [19] proposed a multitasking Convolutional Neural Network (CNN) architecture, including FeatureNet, SegNet, and ClsNet architectures, to perform feature extraction, segmentation, and classification, respectively. A feature passing module was proposed to transmit information between the segmentation and classification modules, improving the performance of both tasks. An approach using encoder-decoder technique for semantic segmentation of pixel-wise based on SegNet is proposed in Youssef et al. [20]. Similar to the SegNet network [21], each decoder layer is connected to a corresponding encoder layer. The network is trained from scratch, with no fine-tuning or transfer of learning. The dropout technique is applied to each convolutional layer of the encoder for overfitting issues. Finally, a softmax layer is used for the output containing three classes: lesion, skin, and artifacts.
Some authors [22,23,24] performed the lesion segmentation task in dermoscopic images and evaluated the U-Net network in the ISIC 2018 database. Abder-Rahman et al. [22] compared a supervised approach with an unsupervised one and analyzed the U-Net network with a "SELU" activation function, which, according to them, enabled an improvement in the results. Other authors [24] proposed a two-stage approach: in the first stage, the segmentation of the skin lesion is performed using the U-Net; and, in the second, the edge is extracted from the segmented image using FuzzEdge method. Codella et al. [25] proposed a U-Net-like architecture to segment the skin lesion. In their proposal, the images are inserted into the network with six color channels, Red, Green, Blue (RGB), and Hue, Saturation, Value (HSV). More recently, Arora et al. [26] proposed the 11-layer CNN with two segmentation models. In Arora et al. [27], and Tong et al. [28], the authors presented a proposal inserting an Attention Gate (AG) module, which is a mechanism used in human perception, i.e., more attention is paid to a certain region and demands less attention to less important details. Arora et al. [27] modified the U-Net network by inserting Group Normalization (GN) layers in the encoder and decoder path to normalize the resources. They were later activated by an activation unit ReLU. The bottleneck of the proposed architecture is composed of dilated convolutions, and AG focuses on skip connections details.
Hasan et al. [29] proposed a hybrid-CNN that has three feature extractor modules. The segmentation, rebalancing, and augmentation tasks were performed as pre-processing. Finally, DermoExpert's weights were passed to a web application. The authors made available the code and segmented masks on GitHub. The authors evaluated the model's performance on three different bases: ISIC 2016, ISIC 2017, and ISIC 2018. Wu et al. [30] proposed a feature adaptive transformer network named FAT-Net based on the classical encoder-decoder architecture; they also do an ablation study. The proposal was evaluated in the ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. Khouloud et al. [31] proposed an architecture based on double encoder-decoder named W-net for segmentation task. The W-net is the junction between a ResNet Encoder-Decoder and a ConvNet Encoder-Decoder. This method also evaluated its performance in ISIC 2016, ISIC 2017, ISIC 2018, and PH2 datasets. Mohakud et al. [32] proposed a Fully Convolution Encoder-Decoder Network (FCEDN). Instead of manually defining the parameters, its hyperparameters were optimized by an Exponential Neighborhood Gray Wolf Optimization (EN-GWO) algorithm. Finally, Kaur et al. [33] proposed a two-step system. The first step, pre-processing, consists of a hair removal module; according to the authors, it improved the general performance. The second step segments the skin lesion using a CNN based on encoder-decoder architecture.
Based on the excellent performance of the U-Net network in medical image segmentation tasks, this work proposed a convolutional neural network based on U-Net [34]. We apply some modifications to its original architecture, such as increasing the number of layers in the network and adjusting the optimization and activation functions. Furthermore, we insert dropout layers to achieve better generalization. An ablation study is still carried out, demonstrating some of the experiments carried out to justify using specific parameters. Finally, the ISIC 2017 [35], and ISIC 2018 [36] databases were used to evaluate the network considering the accuracy and Jaccard metrics. In addition, the results achieved are compared with other research works in the literature.

Proposed model
The architecture proposed in this study is presented in Table 1. Our proposal is based in a U-net architecture formed by next blocks: an encoder, a bottleneck and a decoder. In the encoder path, that is, on the left side of the network, regular convolutions and max-polling are applied. In this step, the dimensions of the images are reduced (x and y axes) while their depth (z axis) gradually increases. For example, considering the network entry of an image 256 × 256 × 3, at the end of the encoding, there is an image 8 × 8 × 128. This process at the encoder enables the network to extract relevant information about the image. The layers that make up the encoder block are described below: • 2D convolution, with a 3 × 3 kernel and activation function "ReLU"; • 2D convolution, with a 3 × 3 kernel and activation function "ReLU"; • Max-pooling 2D with pool (2, 2) and strides (2, 2)).
We add a dropout layer of 0.2 in "block 4".
In the decoder path, the image size (x and y axes) increases, and the depth (z axis) decreases, both gradually. From the application of up-sampling and jump connections concatenating the output of the convolution performed in the encoder with the same level in the decoder, the image information is retrieved, and thus it is reconstructed. The decoder block consists of the following layers: • Up-sampling 2D com size 2 × 2; • Concatenation between the 'us' and 'skip' layers, where 'us' is the output of the up-sampling layer, and 'skip ' is the output of the convolution at the same level as the encoder; • 2D convolution, with a 3 × 3 dimension kernel and "ReLU" activation function; • 2D convolution, with a 3 × 3 dimension kernel and "ReLU" activation function.
Between the encoder and decoder, we have the bottleneck block, formed by two convolutional layers: • 2D convolution, with a 3 × 3 dimension kernel and "ReLU" activation function; • 2D convolution, with a 3 × 3 dimension kernel and "ReLU" activation function.
A final convolutional layer was applied over the final feature map, containing the following parameters: • 2D convolution layer using filter 1, with kernel size 1 × 1 and "Sigmoid" activation function.
In the present work, some modifications in the network U-Net were proposed. One of the modifications is related to its structure. The original U-Net [34] architecture has four blocks in the encoder path and four blocks in the decoder path. From experiments carried out with the shallower and deeper net, it can be seen that the net with four blocks is sensitive to noise found in the images. The network became less sensitive by including one more block in the encoder and decoder. Therefore, as proposed, the architecture now has five blocks on both paths. We use the Sigmoid function as the active function of the last convolutional layer of the network. The other parameters related to its original structure were kept.
Other important changes made to the network were in the optimization and loss functions. For the optimization function, we use Ada [37] with epsilon = 2e − 4 and beta 1 = 0.5, and for the loss function the binary crossentropy. After some adjustments in the network parameters, the proposal significantly improved the segmentation result. In Table 2, the original U-Net is compared to the U-Net with the proposed adjustments. The network with its adjustments was trained for 100 epochs and with batch size of 1.  Finally, Table 1 presents, in more detail, the general structure of the proposed network. In each block, the applied operation and the input and output size of the layer in question are demonstrated. In the decoder, the concatenation is represented as a layer, and information regarding the layers to be concatenated is presented in all blocks. At the end of the net, as output of the last convolutional layer, there is a generated mask of size 256 × 256 pixels with only one color channel.

Dermatoscopic Image Database
In the present work, the images were extracted from the public sets of ISIC 2017 challenges [35] and ISIC 2018 [38,39]. The ISIC 2017 image set includes 2,750 dermoscopic images, with 2,000 images for the training stage, 150 images for validation, and 600 images for testing. For all these images, the corresponding ground truths are available.
The ISIC 2018 image set contains 3,694 dermatoscopic images. For training it is available 2,594 skin lesion images, 100 images for validation, and 1,000 images for testing. The ground truths of the images for validation and testing are not available. Therefore, for the present work, only 2,594 training images were used. The others were discarded.
In both sets, the images were collected in leading international clinical centers with a dermoscopic device and submitted to annotations and markings by dermatologists. The markings, which originated the ground truth, include global dermoscopic characteristics, which discriminate types of lesions. Some examples of the images and their ground-truth, available in both the ISIC 2017 and the 2018 base, can be seen in Figure 2. For each image, it is shown below, its corresponding ground truth.

Image Pre-processing
Images in their original form are often not ready to be analyzed. The presence of artifacts in the images can harm the performance of the segmentation task, and, therefore, its treatment is necessary in order to improve its quality [40]. The dermoscopic images of bases ISIC 2017 and 2018 are of different sizes with high resolution, which can considerably increase the processing time. In order to decrease processing time, images have been resized to 256 × 256 × 3 pixels and ground-truth to 256 × 256 × 1. Normalization was performed, changing the pixel interval from [0,225] to [0,1]. The application of the data augmentation technique, which produces alternative data from the existing dataset [41], is essential to train a neural network for the segmentation task when working with a database containing a few images [42]. 90 • , flip horizontal and vertical rotation were applied to the resized images.

Implementation
The proposed algorithm for the segmentation task was implemented with Keras based on Tensorflow. The experiments were carried out on a Dell computer, with an Intel(R) Core ™ i7-3770 CPU, 3.40GHz CPU with 16GB of memory, and an Nvidia GeForce GT 640 GPU with 1GB of memory. The CentOS Linux 7 operating system was used.

Evaluation Metrics
In the present work, the Accuracy, Jaccard, sensitivity, specificity and Dice coefficient metrics were adopted, calculated from the following information obtained from the confusion matrix: • True Positive (VP): when the class to be predicted (lesion pixels) is correctly predicted; • False Positive (FP): when the class to be predicted (lesion pixels) is incorrectly predicted; • True Negative (VN): when the class you don't want to predict (skin pixels) is correctly predicted; • False Negative (FN): when the class you don't want to predict (skin pixels) is incorrectly predicted.
Thus, accuracy is the proportion of correctly predicted cases, both true positives, and true negatives. The Equation (1) represents the accuracy metric: The Jaccard Index is a metric commonly used to assess the detection of an object. It can be defined as the intersection ratio over the union of the segmented image predicted by the network and the ground truth. The larger the overlap, the better the Jaccard value. The Jaccard index is defined by the Equation (2): The Dice coefficient is the metric used to measure the similarity of two samples. The Equation (3) defines the Dice Coefficient. .
The sensitivity, also known as true positive rate, calculates the proportion of samples that are genuinely positive.
The specificity, also known as true negative rate, is the proportion of samples that are genuinely negative.

Experimental Results
A neural network model needs to generalize training learning in the test set. One of the biggest obstacles faced in the generalization was the difficulty in obtaining a sufficiently large base of images. The approach usually applied for databases with many images divides the dataset into training, validation, and testing subsets. For databases that contain few images, such as ISIC 2017 and ISIC 2018, the model may not generalize as expected. Thus, to minimize the impact due to the small number of images, the data augmentation [41] technique was applied. From ISIC 2017 and ISIC 2018, we used 2,750, and 2,594 images, respectively. The datasets were divided: 80% of the images for training and 20% of the images for testing. Thus, the ISIC 2017 base was left with 2,200 images for training and 550 images for testing. The ISIC 2018 base was left with 2,075 images for training and 519 images for testing. We ran the experiments five times and reported the mean value. In the proposed model, the number of blocks of the convolution path and the expansion path, the optimizer, the Loss function, the application of techniques such as data augmentation, dropout, and normalization, the function used in the activation layer (both in internal layers of the network and in the last layer) were tested. Input image size, batch size, and base splitting were also evaluated. Finally, we arrive at the following configuration shown in Table 3. In Table 4, we can see the results achieved by the proposed model evaluated in the ISIC 2017 image dataset. Furthermore, the results of some methods in the literature that performed the segmentation task are presented, including the U-Net, considering the Accuracy evaluation metrics and the Jaccard Index. Analyzing the Accuracy metric, the model obtained 0.949, which represents a better performance compared to Liu et al. [43], which obtained 0.930. Furthermore, the proposed model surpassed the results presented in Goyal et al. [44], Vesal et al. [45] and the original U-Net, that was implemented in this work according to Ronneberger et al. [34], leaving only 0.001 behind the result of Li and Shen et al. [46]. Observing the results for the Jaccard Index, our model obtained 0.833, that is, a higher result than the proposed in Liu et al. [43], Goyal et al. [44], Vesal et al. [45], and Li and Shen et al. [46].  [46] 0.950 0.753 Original U-Net [34] 0.877 0.325 Proposed model 0.949 0.833

Experimental Results for 2018 ISIC Base
In Table 5, we can see the results achieved from the experiments performed, referring to the accuracy and Jaccard obtained. The values resulting from the tests were compared to the methods in the literature, which propose a solution to the problem of skin lesion segmentation using machine learning techniques and tests in the ISIC 2018 database. The proposed model achieved an accuracy of 0.954, and its Jaccard was 0.850. The model proposed in this article, even being simpler, surpassed in terms of accuracy the method proposed by [47], which reached 0.935. The authors in [47] applied the Shades of Gray method in the pre-processing and used an ensemble for segmentation with two neural networks, the VGG-UNet, and the DeeplabV3. It presents a better Jaccard Index compared to the proposed model by [48]. The authors evaluated their proposal also based on U-Net, called C-UNet, by applying a Dice Finetune technique.
In Figures 3 and 4 are presented some segmentation results made by the proposed model, containing images with different levels of difficulty for the segmentation task, resulting from the evaluation of the bases ISIC 2018 and ISIC 2017, respectively. In the first column are the original images acquired by a specialist through a dermatoscopic. In the second, we can find the ground truth representing the segmentation defined by specialists in the area. In the third column, the mask is predicted by the original U-Net network, and in the fourth column, the mask is predicted by the proposed modified U-Net. In some images, the net segments very well, but in others, the net fails, not correctly segmenting the lesion. This is due to different formats of lesions, which may indicate a need for a more significant number of images for training.

Ablation Study for U-Net Parameters
In order to verify the effectiveness of the model regarding its parameters, an ablation study was carried out. Initially, the model's performance is compared considering different numbers of epochs. The experimental results are reported in Table 6, where it can be seen that up to epoch 100, there is an improvement in the Jaccard value and the Dice Coefficient. After that, the network does not improve significantly for both metrics. The databases used in the experiments are composed of high-resolution images, so resizing to reduce the size of the images to be processed is necessary. Tests were performed with images of 128 × 128, 160 × 160, 256×256 and 512×512, as can be seen in Table 7. So, in the final experiments, the image size was set to 256. The experiments show that it is unnecessary to work with larger images, thus reducing the computational cost and the necessity of powerful computers. Based on the experiments, we conclude that high-resolution images introduce more noise in the training process.  Table 8 shows the performance of the experiments using various optimization functions. When observing the results, it is clear that the Adam function with the adjustments in its parameters favors the network to reach the best result found during the trainings. In Table 9, there are some tests performed considering a dropout layer. By comparison, the test presented in the first row of the table shows the performance without dropout. As can be seen, the Jaccard variation for the experiments is minimal, even when testing different values and in different blocks. Experiments carried out with different numbers of blocks show that the U-Net with 5 blocks in encoding and 5 blocks in decoding presents a better result, these data can be visualized in Table 10. Finally, we have the experiments carried out to find the best Activation and Loss Functions. The results regarding the activation functions are shown in Table 11, and it is concluded that the Relu function favors the network in the lesion segmentation task. As for the Loss function, shown in Table 12, the binary crossentropy function has a better result compared to the Poisson Loss function. Other Loss functions were tested, however, they were discarded for presenting unsatisfactory results.

Conclusion
We presented a U-Net-based approach for skin lesion segmentation in this work. The ISIC 2017 and ISIC 2018 dermoscopic image databases evaluated the model. Modifications were proposed to improve generalizability. Some of the tests performed are demonstrated from an ablation study performed. Some optimization functions, like the Adam function, were tested with some parameter adjustments, which enabled the network to obtain a better result. The Loss function achieved the best result is the binary cross-entropy function. Other functions were also tested but with a lower performance. Furthermore, tests were performed on the network with variations in its number of layers, making it deeper and deeper, concluding that the five blocks in the encoding path and five blocks in the decoding path made the network less sensitive to noise. The impact of the dropout layer in different situations was also evaluated. Comparing the proposed model with models found in the literature, it is recognized that there are still adjustments to be made, but the results prove the feasibility and potential of the proposed model. The data set with a small number of images was limited, as they are medical images, and the difficulty in having a sufficiently large database is understood. Another parameter that has an impact on this problem is the image size. Higher-resolution images introduce noise in the training process since minor variations appear more evident.