A weakly supervised machine learning model to extract features from microscopy images

red information

DISCLAIMER: This blog is fully automated, unmonitored, and does not reflect the views of Garett MacGowan. The ML model may produce content that is offensive to some readers.

blue information

This blog post was generated with a (potentially) real article title as the only prompt. A link to the original article is below.

blog header

Photo by Sharon Pittaway on Unsplash

Generated: 5/16/2022
A weakly supervised machine learning model to extract features from microscopy images and predict the location of breast cancer in three-dimensional space.

1. Background {#sec1}

The diagnosis of breast cancer can be challenging for radiologists and pathologists due to large size and irregular, multi-nodular architectural patterns of tumors. This diagnostic procedure can be time consuming as it requires extensive knowledge and experience coupled with a close attention toward patient\'s age, menopausal status and other co-morbid conditions. In fact, the misdiagnosis of breast carcinoma is common \[[@bib1],[@bib2]\]. Consequently, to reduce the time spent and prevent expensive and invasive surgery, there is a high demand to develop non-invasive diagnostic tools.

Image-based analysis methods hold the potential to improve breast cancer diagnosis \[[@bib3]\], especially when high resolution images can be achieved with non-invasive sources \[[@bib4]\]. Although the development of advanced image analysis methods has provided better classification performance \[[@bib5]\], these models depend on a pre-defined set of features and a set of labeled training data which are usually not always readily available. Therefore, it is very challenging to extract image features in a manner that is independent of any prior knowledge and the ability to predict the location of a tumor in three-dimensional space without relying on external knowledge, such as a breast atlas. Weakly supervised methods have shown the potential to address these challenges through the use of only a few labeled training sets \[[@bib6], [@bib7], [@bib8], [@bib9]\]. These methods usually first generate a label for a class (e.g., tumor) in advance and then find a set of informative image features.

An important issue in machine learning is the class imbalance due to the rarity of a large set of training samples of a certain disease. For instance, in the case of breast cancer images, the sample size for cancerous and non-cancerous (benign) classes is usually very small \[[@bib10]\]. To address this issue in image classification, it is common to either downscale images so that it has the same size of the training samples or use various sampling methods such as oversampling \[[@bib11]\]. Although these methods successfully reduce the class imbalance problem, it has been previously shown that the effect of down-sampling may be dependent on the feature extraction method utilized \[[@bib12]\]. Furthermore, these approaches can increase the feature extraction time as well and lead to additional cost of labor and time. To overcome these challenges, we propose a weakly supervised machine learning framework in order to extract features from a small number of training samples and to predict the spatial location of the tumor. The framework is simple yet effective in terms of time-saving and cost-effective.

Most existing weakly supervised methods utilize a patch-based strategy which requires the presence of a few labeled images. Such methods start by downsampling the original high-resolution images to generate a set of low-resolution patches, find the optimal class in each patch and use a set of non-maximum suppression method to obtain a set of bounding boxes. However, this approach results in a significant decrease in the resolution of the final results. Besides, it is also time-consuming for down-sampling high-resolution images with the size between $50{cm} \times 50{cm}$ to $1{cm} \times 1{cm}$. This is mainly attributed to the fact that each patch in such input images covers a wide area that can require a long computational time to calculate an optimal class (i.e., tumor) and a set of boundaries of a tumor \[[@bib13]\]. This also prevents the usage of this method in real-time on-line use cases.

A solution to this problem is to adopt a new strategy to generate samples of a large set of labeled samples such as the patches and then select a bounding box for each sample instead of using a patch of a small size. In such an approach, the training samples can be more robust and the process is simple enough. In this research, we generate a set of samples of the breast structure with their corresponding bounding boxes for training a weakly supervised model by employing an algorithm that can perform fast searching instead of any kind of oversampling methods such as random sampling or other patch-based approaches \[[@bib14]\]. The samples with their corresponding label are then used to estimate the boundaries of a tumor.

The contributions of this work include the following:•The design of a new robust sampling strategy is presented in order to estimate the boundaries of a tumor from low-resolution images. Samples of the breast are generated by employing a new approach. These samples are then used to estimate the boundaries of the tumor through a weakly supervised machine learning framework.•A novel and effective algorithm is employed to generate a set of bounding boxes of breast structure and then the location of the tumor is predicted by utilizing the trained model.

2. Methods {#sec2}

In this section, we first describe the method used to generate a high-resolution patch for a sample of a breast structure as well as the generation of the whole image sample. We then briefly present the proposed weakly supervised machine learning algorithm.

2.1. Breast image samples {#sec2.1}

A sample of a breast is used to extract a set of high-resolution patches that cover the entire breast structure. Such samples help us to estimate the location of the tumor by predicting the boundaries of the tumor with an estimated margin of error which may be considered as the minimum distance between the boundary of the tumor and the boundary of the tissue that is labeled with a certain class. However, it is challenging to determine this distance as it is not straightforward as to locate the tumor in a three-dimensional space. Moreover, this margin of error depends on the amount of tissue labeled as the tumor (i.e., an image sample with less tissue will have a larger margin of error) and the size of the images.

The sampling strategy starts with pre-processing an input image (cf. [Figure 1](#fig1){ref-type="fig"}). The strategy uses a $12 \times 12$ sliding window approach, with a shift of 1 pixel (i.e., pixel $1,{\ 2}\text{,}\text{…},\ 12$), to extract a set of patches (i.e., $P_{i}\text{,}\ i = 1\text{,}\text{…},\ 12 $) from the input images which are then scaled to a size of a $128 \times 128$ image in order to apply patch-based methods. The location of the tumor is computed if the tumor is in the center (i.e., location $14$) and the margin of error is defined as the minimum distance between the tumor margin and boundaries of the sample.Figure 1Example of a pre-processed image before applying the oversampling technique: the images in *a)* has different sizes and resolution. The resulting image after applying the patch-based oversampling technique in *b)* has the same size and resolution as the images in *a)*.Figure 1

2.2. Oversampling techniques {#sec2.2}

The goal of the oversampling techniques is to obtain as close as possible a sample that represents what a patch of a $128 \times 128$ would be like in such a sample of the original image. Note that the obtained sample is of the same size as the original image and the resolution of the patch is $16 \times 16$ and $8 \times 8$.

A classical technique to perform oversampling is the image down-sampling followed by a nearest neighbor down-sampling \[[@bib15]\]. However, these methods lead to pixel-based loss of information and this effect decreases the accuracy of the estimated tumor location. A better and more robust down-sampling method is the patch-based approach which starts by generating a set of patches of size $16 \times 16$ for which each pixel is assigned to the closest patch, without losing any pixel (cf. [Figure 1](#fig1){ref-type="fig"}).

After applying an edge-finding approach for local maxima and minima, a set of $128 \times 128$ bounding boxes are created for each patch (i.e., $B_{i}$). The process is similar to the steps used to generate a set of patches starting with pre-processing from the input image, followed by the creation of patches from the input image. However, after pre-processing an input image, bounding boxes of $16 \times 16$ can be achieved by computing its maximum and minimum values in the $x$ and $y$ directions. These boxes can be used to generate samples of a $16 \times 16$ in an image sample by employing a grid sampling approach along with the nearest neighbor scheme.

2.3. The weakly supervised patch-classification algorithm {#sec2.3}

In weakly supervised learning, the training samples are labeled with the corresponding class. For image classification tasks, the data need to be partitioned such that the training samples can be separated as non-cancer samples from cancer samples and this can be done by employing a classifier such as a random forest \[[@bib16]\].

Garett MacGowan

© Copyright 2023 Garett MacGowan. Design Inspiration