Spark Based Framework for Breast Cancer Analysis
10 Pages Posted: 27 Feb 2018
Date Written: November 15, 2017
Breast cancer is the second most common cancers discovered around the world and that record for one-fourth of all cancers in women. Among the other kinds of diseases, breast cancer causes more number of deaths in many countries. An early identification for breast tumor gives the opportunity of its cure; therefore, an extensive amount of investigations are presently setting on to recognize techniques that could identify breast cancer in its initial phases. The healthcare sector has a tremendous amount of information and imperative data about patients and their well-being conditions. Hence, it is the need of the hour to utilize that huge information for medical practitioners to predict the disease. One approach for taking care of this issue has been handled by numerous researchers utilizing Machine Learning (ML) strategies to upgrade the prediction procedure through applying different tree-based classifiers. However, most of the tree based ML algorithms will not be able to handle huge amount of complex data. This issue is addressed through efficient tree-based classifiers (Decision Tree, Random Forest classifier, gradient boosting classifier) with Apache Spark framework. The experiments are conducted using Wisconsin Breast Cancer Dataset (WBCD) from UCI repository. Experimental results have demonstrated that the Random Forest Classifier outperformed the other two tree-based classification algorithms in most of the cases form this research study.
Keywords: Breast Cancer Wisconsin diagnostic dataset, Apache Spark, Big data in Healthcare, Decision Tree, Random Forest, Gradient Boosting Classifier
Suggested Citation: Suggested Citation