Skip to content

ACM40960/project-shivsucd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Prediction of Breast Cancer Incidence

Python TensorFlow Contributions welcome

Welcome to the Prediction of Breast Cancer Incidence project repository! This project leverages machine learning and deep learning techniques to enhance the accuracy of breast cancer detection. By analyzing two distinct datasets— the Wisconsin Breast Cancer dataset and the BreakHis400x Image Dataset, this project aims to classify breast tumor images and predict malignancy based on cell nuclei characteristics.

Note: It is recommended to view this README in light mode for better graph visibility.

Table of Contents 📑

Project Overview

Breast cancer remains a leading cause of mortality among women globally, making early detection crucial for effective treatment. This project is designed to develop and evaluate machine learning models that accurately classify breast cancer images and predict malignancy using the Wisconsin Breast Cancer and BreaKHis 400X datasets.

By analyzing the visual characteristics of cell structures and applying deep learning techniques, this project aims to enhance early detection capabilities and support medical diagnosis.

Project Highlights

  • Multi-Model Approach: Various machine learning models including Random Forest, Logistic Regression, K-Nearest Neighbors (KNN), and Neural Networks on the Wisconsin dataset have been evaluated and compared to obtain the best performing model.
  • Deep Learning with DenseNet: The DenseNet CNN architecture has been used for classifying images from the BreakHis400x dataset as benign or malignant.

Exploratory Data Analysis:

  • Detailed Exploratory data analysis on the Wisconsin Dataset has been performed as shown below:

    Correlation Heatmap

    Scatter Plots of Symmetry, Fractal Dimension, and Concavity

    Perimeter, Smoothness, and Texture measures by Diagnosis

    Distribution Plot of the Target Variable

  • Detailed Exploratory data analysis on the BreaHis400x Dataset has been performed as shown below:

    Non-Cancerous Cell Image

    Non-Cancerous Cell Image

    Cancerous Cell Image

    Cancerous Cell image

    Distribution plot for the Images

    Distribution of Images Before data Augmentation

  • Data Augmentation : As observed in the above picture, the benign class is heavily imbalanced, which could create a bias in training of the images towards malignancy. In an effort to avoid this, data augmentation is peformed on the benign images by flipping and rotating the benign images existing, and increase the number of images to ensure that the augmented images can aid in better classification. The dataset after the balancing of the dataset is as follows:

Distribution plot for the Images after Data Augmentation

Distribution of Images After data Augmentation

  • Scalability: The project is designed to be scalable for future improvements, including the integration of more diverse datasets and the exploration of ensemble methods.

Installation

To get started with this project, clone the repository with the following code:

git clone https://github.com/ACM40960/project-shivsucd.git
cd Prediction of Breast Cancer Incidence- Shivani-23200782

Create a Virtual Environment with Python verison 3.12.

To create a virtual environment on Windows OS in Anaconda Command Prompt, you can use the following commands:

conda create -n breast_cancer_pred python==3.12
conda activate breast_cancer_pred

Install the dependencies with the following command:

pip install -r requirements.txt

Usage

After installation, you can run the python notebooks for each dataset:

  1. Wisconsin Dataset:

    • Open the Breast_Cancer_classification_wisconsin.ipynb notebook.
    • Execute all the cells to analyze the Wisconsin dataset using multiple machine learning models.
  2. BreaKHis 400X Dataset:

    • Open the Breast_cancer_Classification_DenseNet.ipynb notebook.
    • Execute all the cells to use DenseNet201 to classify breast cancer images.

You can run these notebooks in Jupyter Notebook or any other compatible environment.

Datasets

  • Wisconsin Breast Cancer Dataset: Contains numeric features derived from fine needle aspirates of breast masses, used for predicting breast cancer diagnosis.
    • Features include radius, texture, and perimeter of cell nuclei.
  • BreaKHis 400X Dataset: High-resolution microscopic images of breast tumor tissues categorized into benign and malignant classes.
    • Used to train deep learning models for distinguishing between benign and malignant breast cancer based on visual characteristics.

Model Architecture

  • DenseNet201: A deep convolutional neural network used for classifying images in the BreakHis400x dataset. The model achieved a training accuracy of 96% and a test accuracy of 87.45%.

    Training and Validation Accuracy

  • Random Forest, Logistic Regression, KNN, Neural Networks: Various models evaluated on the Wisconsin dataset, with Random Forest emerging as the best performer with an accuracy of 96.49%.

Results

  • DenseNet201: Demonstrated strong classification capabilities on the BreaKHis dataset with a training accuracy of 96%, precision of 0.91, recall of 0.94, and an F1 score of 0.97.

    Confusion Matrix:

    Confusion Matrix

Evaluation figures:

Accuracy Plot

Loss Accuracy

ROC-AUC Plot

Classification Report

  • Random Forest: Outperformed other models on the Wisconsin dataset with an accuracy of 96.49%, F1 score of 0.952, Precision of 97.56%, Recall of 93.02%, and a balanced accuracy of 95.8%.

Comparative Model Metrics:

Radar Chart of all the metrics

Evaluation Metrics Table:

Styled Table of all metrics

Conclusion and Future Work

This project demonstrates the potential of machine learning and deep learning models in accurately predicting breast cancer. The DenseNet-based CNN achieved a commendable training accuracy of 96.4% on the BreaKHis dataset, and the Random Forest model emerged as the best-performing model on the Wisconsin dataset with an accuracy of 96.49%. These results highlight the efficacy of advanced algorithms in distinguishing between benign and malignant cases, providing valuable support in medical diagnosis.

However, the variability in model performance across different datasets emphasizes the importance of dataset-specific model selection and further model refinement for clinical applications.

Future work could focus on integrating more diverse datasets and exploring ensemble methods to enhance model robustness and generalization.

Contributing

Contributions are welcome! If you have ideas for improving this project or want to add more models or datasets, feel free to submit a pull request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

  1. Arnold, M., Morgan, E., Rumgay, H., Mafra, A., Singh, D., Laversanne, M., Vignat, J., Gralow, J. R., Cardoso, F., Siesling, S. et al. (2022). Current and future burden of breast cancer: Global statistics for 2020 and 2040, The Breast 66: 15–23.

  2. Huang, J., Chan, P. S., Lok, V., Chen, X., Ding, H., Jin, Y., Yuan, J., Lao, X.-q., Zheng, Z.-J. and Wong, M. C. (2021). Global incidence and mortality of breast cancer: a trend analysis, Aging (Albany NY) 13(4): 5748.

  3. Hussain, S., Ali, M., Naseem, U., Nezhadmoghadam, F., Jatoi, M. A., Gulliver, T. A. and Tamez-Pe˜na, J. G. (2024). Breast cancer risk prediction using machine learning: a systematic review, Frontiers in Oncology 14: 1343627.

  4. Muller, F. M., Li, E. J., Daube-Witherspoon, M. E., Vanhove, C., Vandenberghe, S., Pantel, A. R. and Karp, J. S. (2024). Deep learning denoising for low-dose dual-tracer protocol with 18f-fgln and 18f-fdg in breast cancer imaging, Annual Meeting of the Society of Nuclear Medicine and Molecular Imaging.

  5. FA Spanhol, LS Oliveira, C. Petitjean and L. Heutte, "A Dataset for Breast Cancer Histopathological Image Classification," in IEEE Transactions on Biomedical Engineering, vol. 63, no. 7, pp. 1455-1462, July 2016, doi: 10.1109 / TBME.2015.2496264.

  6. Machine Learning Algorithms For Breast Cancer Prediction And Diagnosis: Mohammed Amine Naji , Sanaa El Filali , Kawtar Aarika , EL Habib Benlahmar , Rachida Ait Abdelouhahid , Olivier Debauche.

Data Sources

Authors

  • Shivani - 23200782: MSc Data and Computational Science, UCD, Dublin

Demonstration

  • A Streamlit application is deployed that allows you to perform Image Analysis and Cell Nuclei Analysis.

    • To perform Image Analysis, double click on Image Analysis Button and it will take you to the Image upload section.
    • Download test images from : Images.
    • Upload the benign or malignant image from the Test folder above, and you will see the prediction at the bottom of the page.
    • Scroll back up to the page, and click on the X button just below the upload button to erase the uploaded image, and upload more images, or double click on the home button to return home.
    • To Perform Cell Nuclei Measurements predictions on the Wisconsin dataset, double click on Cell Nuclei Analysis. Use the slider to slide and adjust the value of the measurements shown, and observe the change in the prediction on the page. The model being used in the Random Forest Model, thus selected as the best model, upon analysis.
  • Please find the link for the Streamlit Application to test the functionality here: Prediction of Breast Cancer Incidence

Poster

Please review the poster presented at the University for the Poster Presentation here:Poster.

About

project-shivsucd created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published