Project Report Major Project
Project Report Major Project
PROJECT REPORT
ON
SUBMITTED BY
PRIYANSHI HEMRAJANI |NAMAN POKHARNA
MAYANK MEHRANIYA| PIYUSH SONI| DAKSH SONI
UNDER GUIDANCE
OF
PF. DR. R.K. SOMANI
DEAN
SCHOOL OF ENGINEERING AND TECHNOLOGY
SESSION: (2023-24)
AUTHOR’S DECLARATION
I hereby declare that the work, which is being presented in the Project Report, entitled
“Deeplung: Predictive Modelling for Lung Cancer Risk Assessment” in partial fulfillment
for the award of Degree of “Bachelor of Technology” in Computer Science Engineering and
submitted to the Department of Computer Science Engineering, Sangam University. Project
is a record of my own investigations carried under the guidance of Pf. Dr. R.K. Somani,
Dean of School of Engineering and Technology, Sangam University, Bhilwara, Rajasthan,
India.
I have not submitted the matter presented in this dissertation anywhere for the any other
Degree.
Counter Signed by
Pf. Dr. R.K. Somani
Dean
Department of Computer Science & Engineering
School of Engineering & Technology
Sangam University, Bhilwara (Raj.)
CERTIFICATE
I feel great pleasure in certifying that the project entitled “Deeplung: Predictive Modelling
for Lung Cancer Risk Assessment” carried out by Priyanshi Hemrajani |Naman
Pokhrna| Mayank Mehraniya | Piyush Soni | Daksh Soni under the supervision of Pf. Dr.
R.K. Somani. I recommend the submission of project.
Date: ………………….
Sign ………………………………………
(Dr. Vikas Somani)
Head of Department of Computer Science & Engineering,
School of Engineering & Technology,
Sangam University, Bhilwara
ACKNOWLEDGEMENT
This Dissertation would not have been successful without the guidance and support of a large
number of individuals.
Pf. Dr. R.K. Somani, Dean of School of Engineering and Technology my dissertation
supervisor, who believed in me since the initial stages of my dissertation work. For long time,
he provided insightful commentary during my regular meetings, and he was consistently
supportive of my proposed research directions. I am honored to be his first graduate student.
His consistent support and intellectual guidance inspired me to innovative new ideas. I am
glad to work under his supervision.
I am grateful to Dr. Vikas Somani, Head, Department of Computer Science Engineering for
his excellent support during my dissertation work. Dr. Awanit Kumar, Bachelor of
Engineering and Technology Computer Science Coordinator, spent many hours listening
to my concerns, working with me to navigate the bureaucracy, and assisting me with my most
important decisions.
Thanks all my friends and classmates for their love and support. I have enjoyed their
company so much during my stay at college name. I would like to thank all those who made
my stay in college, an unforgettable and rewarding experience.
Last, but not least I would like to thank my parents for supporting me to do complete my
master’s degree in all ways.
ABSTRACT
Lung cancer remains a significant global health challenge, often diagnosed at advanced
stages with limited treatment options, leading to high morbidity and mortality rates. This
project focuses on developing a machine learning-based predictive model to assess an
individual's risk of developing lung cancer. Leveraging diverse input factors including
smoking habits, environmental pollutants, genetic predisposition, occupational hazards, and
health parameters, the aim is to create a robust and accurate predictive tool.
The project objectives encompass comprehensive data collection, preprocessing, and feature
engineering to extract crucial insights from datasets sourced from reliable medical records
and research databases. Various machine learning algorithms, including logistic regression,
decision trees, random forests, and neural networks, are employed to build predictive models.
Hyperparameter tuning and ensemble methods are utilized to enhance model performance
and robustness.
The models are rigorously evaluated using cross-validation techniques and diverse evaluation
metrics to ensure reliability and generalizability. Interpretability techniques are applied to
explain model predictions, facilitating user trust and understanding, particularly among
healthcare professionals. Ethical considerations regarding patient data privacy and
compliance with regulations are strictly adhered to throughout the project lifecycle.
The ultimate goal is to create a user-friendly predictive tool that aids in early detection,
personalized risk assessment, and targeted interventions for lung cancer. This model has the
potential to significantly impact public health initiatives by informing preventive measures,
policy changes, and resource allocation strategies to mitigate the burden of lung cancer. The
project aims to contribute to advancements in predictive analytics applied to healthcare while
striving to improve patient outcomes and reduce the societal and economic impact of this
devastating disease.
Author’s Declaration……………………………………………………………ii
Certificate………………………………………………………………………iii
Acknowledgements……………………………………………………………..iv
Abstract………………………………………………………………………….v
Abbreviations…………………………………………………………………...vi
Contents…………………………………………………………………...…...vii
List of figures…………………………………………………………………...ix
2. Review of Related
Literature…………………………………...12
2.1.Existing Research and Studies………………………………………….12
2.2.Research Gap…………………………………………………………...18
6. References……………………………………………………….54
7. Appendix………………………………………………………...57
7.1.Technical Details and Additional Graphs
/Charts……………………….57
7.2. Supplementary Information…………………………………………….64
1.1. Objectives
Model Development:
2. Feature Engineering:
Feature Selection: Identifying the most relevant features that significantly
contribute to lung cancer risk using techniques like correlation analysis,
feature importance ranking, etc.
Feature Transformation: Normalizing, scaling, or transforming features to
ensure uniformity and enhance model performance.
3. Model Development:
Machine Learning Algorithms: Exploring various algorithms like logistic
regression, decision trees, random forests, support vector machines, neural
networks, etc., to build and compare predictive models.
8. Continuous Improvement:
Model Updating: Establishing a framework for continuous model
improvement with new data and emerging research to enhance accuracy and
relevance over time.
Feedback Mechanism: Creating a mechanism to receive feedback from
healthcare professionals and users for ongoing refinement.
Conclusion:
The scope of this study encompasses a comprehensive and interdisciplinary approach
involving data collection, preprocessing, model development, evaluation, ethical
considerations, deployment, and continuous improvement. The ultimate goal is to
create a reliable, accurate, and user-friendly predictive tool for assessing an
Personalized Healthcare: Each individual's risk factors for lung cancer can
vary significantly. By considering diverse input factors such as smoking
habits, environmental exposures, genetic predisposition, and health history, the
model aims to provide personalized risk assessments. This personalized
approach allows for tailored interventions and recommendations specific to an
individual's risk profile, enhancing the effectiveness of preventive measures.
In summary, the motivation behind creating a lung cancer prediction model lies in its
potential to revolutionize early detection, personalize healthcare interventions,
positively impact public health policies, advance research methodologies, enhance
patient outcomes, and alleviate the societal and economic burdens associated with
lung cancer.
8. Resource Constraints:
Computational Resources: Availability of computational power and
infrastructure required for processing large datasets and training complex
models.
Budget and Time Constraints: Limitations in funding and time could affect
the extent of data collection, model development, and validation processes.
Understanding and addressing these limitations and constraints are crucial for
managing expectations, ensuring ethical compliance, and developing a model that is
both effective and practical for real-world application.
Data Privacy and Ethics: Ensure compliance with ethical standards, patient
confidentiality, and data protection regulations throughout the project.
Model Deployment: Deploy the finalized model in a suitable environment,
considering integration into healthcare systems or making it accessible
through a user-friendly interface for healthcare professionals.
Genetic Predisposition: Certain genetic factors and family history play a role
in predisposing individuals to lung cancer. Genetic markers and family history
data will be considered to assess genetic susceptibility.
Objectives:
Outcome:
“An evaluation of machine learning classifiers and ensembles for early stage prediction of
lung cancer “(M.I. Faisal): This research paper delves into the realm of predictive modeling
using statistical and machine learning techniques, emphasizing their significance across
various domains like software fault prediction, spam detection, disease diagnosis, and
financial fraud identification. Recognizing the critical role of predicting lung cancer
susceptibility in guiding effective treatments, the study aims to assess different predictors'
effectiveness in enhancing lung cancer detection efficiency based on symptomatic data.
Multiple classifiers—such as Support Vector Machine (SVM), C4.5 Decision Tree, Multi-
Layer Perceptron, Neural Network, and Naïve Bayes (NB)—are rigorously evaluated using a
benchmark dataset sourced from the UCI repository.[1]
"Lung cancer classification tool using microarray data and support vector machines" (G.
Salano): This study introduces an innovative system that harnesses gene expression data from
oligonucleotide microarrays. Its primary goal is threefold: predict the presence or absence of
lung cancer, identify the specific subtype if present, and pinpoint marker genes linked to the
particular lung cancer type. The proposed system serves as a promising tool for expedited
diagnosis and complements existing lung cancer classification methods.[2]
S. Jondhale “Lung cancer detection using image processing and machine learning
healthcare”: Lung cancer remains a leading cause of mortality in India, necessitating
advanced diagnosis and detection methods. With the elusive nature of its causes, early
detection becomes paramount for successful treatment. This research focuses on a lung
cancer detection system employing image processing and machine learning techniques to
classify the presence of lung cancer in CT images and blood samples. CT scan images,
known for their efficacy compared to Mammography, are used to classify patients' images as
normal or abnormal. [6]
M. A. Yousuf, "Detection of Lung cancer from CT image using Image Processing and Neural
network": Lung cancer detection in its premature stages is a focal point of research due to its
critical impact on patient outcomes. The proposed system is designed as a two-stage process
aimed at detecting lung cancer in its early phases, employing a series of steps encompassing
image acquisition, preprocessing, binarization, thresholding, segmentation, feature extraction,
and neural network-based detection. The system begins by inputting lung CT images,
subsequently undergoing preprocessing via various image processing techniques. In the first
stage, a Binarization technique is applied to convert the image into a binary format, followed
by comparison with a predefined threshold value to identify potential lung cancer regions.
The second stage involves segmentation to isolate the lung CT image, and a robust feature
extraction method is employed to capture critical features from the segmented images. [7]
J. M. Hollywood, "A new technique for improving the sharpness of pictures": This research
focuses on a technique known as "crispening" designed to enhance the apparent picture
definition in the CBS color-television system. The method utilizes nonlinear circuitry to
modify the apparent rise time of an isolated step input applied to a bandwidth-limited system.
The principle behind crispening involves adding a second waveform, representing the
difference between the desired and original waveforms, to a slow transition waveform. This
addition aims to create a narrower "spike" shape, superimposed on the original waveform,
effectively reducing the rise time by about half.[11]
“The creation of a method for the early detection and accurate diagnosis of lung cancer that
makes use of CT, PET, and X-ray” images by Manasee Kurkure and Anuradha Thakare in
2016 has garnered a significant amount of attention and enthusiasm. The utilization of a
genetic algorithm that permits the early identification of lung cancer nodules by diagnostics
allows for the optimization of the findings to be accomplished. It was necessary to employ
both Naive Bayes and a genetic algorithm in order to properly and swiftly classify the various
stages of cancer images. This was done in order to circumvent the intricacy of the generation
process. The categorization has an accuracy rate of up to eighty percent [18].
According to Ashwini Kumar Saini et al. (2016), a summary of the types of noise that might
cause lung cancer and the strategies for removing them has been provided. Due to the fact
that lung cancer is considered to be one of the most life-threatening kinds of cancer, it is
essential that it be detected in its earlier stages. If the cancer has a high incidence and
mortality rate, this is another indication that it is a particularly dangerous form of the disease.
The quality of the digital dental X-ray image analysis must be significantly improved for the
study to be successful. A pathology diagnosis in a clinic continues to be the gold standard for
detecting lung cancer, despite the fact that one of the primary focuses of research right now is
on finding ways to reduce the amount of image noise. X-rays of the chest, cytological
examinations of sputum samples, optical fiber investigations of the bronchial airways, and
final CT and MRI scans are the diagnostic tools that are utilized most frequently in the
detection of lung malignancies (MRI). Despite the availability of screening methods like CT
Neural ensemble-based detection is the name given to the automated method of illness
diagnosis that was suggested in Kureshi et al.'s research [21] (NED). The approach that was
suggested utilized feature extraction, classification, and diagnosis as its three main
components. In this experiment, the X-ray chest films that were taken at Bayi Hospital were
utilized. This method is recommended because it has a high identification rate for needle
biopsies in addition to a decreased number of false negative identifications. As a result, the
accuracy is improved automatically, and lives are saved [22].
Kulkarni and Panditrao [23] have created a novel algorithm for early-stage cancer
identification that is more accurate than previous methods. The program makes use of a
technology that processes images. The amount of time that passes is one of the factors that is
considered while looking for anomalies in the target photographs. The position of the tumor
can be seen quite clearly in the original photo. In order to get improved outcomes, the
techniques of watershed segmentation and Gabor filtering are utilized at the preprocessing
stage. The extracted interest zone produces three phases that are helpful in recognizing the
various stages of lung cancer: eccentricity, area, and perimeter. These phases may be found in
the extracted interest zone. It has been revealed that the tumors come in a variety of
dimensions. The proposed method is capable of providing precise measurements of the size
of the tumor at an early stage [21].
Westaway et al. [24] used a radiomic approach to identify three-dimensional properties from
photos of lung cancer in order to provide prediction information. As is well known, classifiers
are devised to estimate the length of time an organism will be able to continue existing. The
Moffitt Cancer Center in Tampa, Florida, served as the location from where these
photographs for the experiment's CT scans were obtained. Based on the properties of the
pictures produced by CT scans, which may suggest phenotypes, human analysis may be able
to generate more accurate predictions. When a decision tree was used to make the survival
predictions, it was possible to accurately forecast seventy-five percent [23] of the outcomes.
CT (computed tomography) images of lung cancer have been categorized with the use of a
lung cancer detection method that makes use of image processing. This method was
described by Chaudhary and Singh [25]. Several other approaches, including segmentation,
preprocessing, and the extraction of features, have been investigated thus far. The authors
have distinguished segmentation, augmentation, and feature extraction, each in its own
unique section. In Stages I, II, and III, the cancer is contained inside the chest and manifests
as larger, more invasive tumors. By Stage IV, however, cancer has spread to other parts of the
body [24], at which point it is said to be in Stage IV.
From the provided research summaries, several potential research gaps or areas for
further exploration might be identified:
Noise Reduction and Image Enhancement Techniques: While the researches touch
upon noise reduction in medical imaging, there might be room to delve deeper into
advanced noise reduction and image enhancement techniques specifically tailored for
dynamic medical images like cineangiograms. Investigating more robust algorithms
could lead to better image quality and more accurate boundary detection.
Automated Boundary Detection: Despite the sophisticated edge detection methods
mentioned, there could be scope for developing more automated and efficient
algorithms to detect boundaries accurately, particularly in cases of low-contrast
regions or images affected by noise. This could involve exploring machine learning or
deep learning techniques for improved segmentation and boundary detection.
Real-time Processing and Analysis: Expanding research on real-time processing of
dynamic medical images, such as cineangiograms, might be valuable. Developing
systems that can process and analyze images in near-real-time during medical
procedures could aid clinicians by providing immediate feedback and guidance.
Clinical Validation and Standardization: While the mentioned research shows
promising results compared to radiologist-detected boundaries, further clinical
validation across a larger and more diverse dataset could be beneficial. Additionally,
establishing standardized protocols and benchmarks for evaluating the accuracy and
reliability of image processing systems in clinical settings could enhance their
adoption.
Integration of Multiple Imaging Modalities: Exploring the integration of data from
various imaging modalities (e.g., MRI, CT scans) alongside cineangiograms could
provide a more comprehensive understanding of cardiac structures and functions. This
integration might offer richer diagnostic insights and improve the accuracy of disease
detection.
User Interface and Clinical Adoption: Investigating user-friendly interfaces and
system integration into clinical workflows could bridge the gap between research and
practical clinical application. Ensuring ease of use and seamless integration of these
systems into existing medical practices is crucial for their widespread adoption.
Designing the system architecture for a machine learning-based predictive model for lung
cancer risk assessment involves several components and considerations. Here's a high-
level overview of the system architecture and design model for such a project:
System Architecture:
1. Data Collection and Preprocessing:
Data Sources: Gather data from diverse sources such as medical records, surveys,
public databases, and research studies containing information on smoking habits,
environmental pollutants, genetic predisposition, occupational hazards, and other
relevant parameters.
Data Preprocessing Pipeline: Develop a robust pipeline for cleaning, formatting,
encoding, and standardizing data. This includes handling missing values, outlier
detection, and feature scaling.
Design Model:
1. Sequential Model:
The process flow might follow a sequential pattern, starting from data collection,
preprocessing, feature engineering, model development, validation, interpretation, and
finally, deployment.
2. Modular Design:
Modularize different components of the system architecture for easier maintenance and
scalability. Modules might include data ingestion, preprocessing, feature engineering,
model training, validation, and deployment.
3. Feedback Loop:
Conclusion:
The system architecture and design model for a machine learning-based predictive model
for lung cancer risk assessment should emphasize data quality, model performance,
interpretability, scalability, security, and ethical considerations. It should be flexible
enough to adapt to evolving data and healthcare needs while delivering accurate risk
assessments and actionable insights for early intervention and personalized preventive
measures.
Prediction Models
The prediction problem is formulated as binary classification. The hospitalization when
cancer occurred was used as a class label. If diagnosed with cancer, we assigned a patient
to the positive class (‘1′). Otherwise, we put the patient into the negative class (‘0′). We
experimented with two different RNN models. These models are advantageous for the
sequence data, especially when one data point is dependent on the preceding data point,
like in our case. The reason is that they have a memory to store the states or information
of previous inputs in order to construct the sequence's subsequent output. This mechanism
is also known as a hidden state. The following equations explain the learning process:
The first model contains layers with LSTM units capable of learning long-term
dependencies in sequential data. Remembering information for long periods is practically
their default behavior. The second model has layers with GRUs. Unlike the LSTM unit,
the GRU has gating units that modulate information flow without separating memory
cells [38]. This structure allows to adaptively capture dependencies from large data
sequences without discarding information from earlier parts of the sequence.
The architectures of both models are identical, with one hidden layer of 64 neurons (Fig.
2). Empirical evaluation of RNN models showed that both the LSTM and GRU
demonstrated superiority over traditional ML models [39]. Since LSTM and GRU
architectures have shown surpassing results in various applications, we compared both in
our experiments.
SVD and embedding layer were tested separately with both RNN methods. The output
layer contains only one neuron with the sigmoid activation function. The adaptive
learning rate optimization algorithm ADAM was used to train the RNN models [40].
A potential problem with training neural networks could be the number of epochs. A large
number of epochs could lead to overfitting, whereas an insufficient number of epochs
may result in an underfit model. That is why in our application, sequential learning
models used the early stopping method, which monitored the model's performance during
training. The objective of the method is to stop the training when the validation loss
(binary cross-entropy loss) starts to increase constantly. As a result, both RNN models
were trained through 20 epochs unless stopped earlier by the method mentioned above.
We used a batch size of 64 since, in such a way, the overall training procedure required
less memory. Furthermore, a smaller size was chosen because it is reported across many
applications that using such small batch sizes achieves training stability and improved
generalization performance [41].
To compare the performance of the proposed sequence learning models, we also trained
four standard machine learning models: DT, MLP, RF, and KNN. Only default settings
Creating a use case diagram for the machine learning-based predictive model for lung
cancer risk assessment involves identifying the primary actors interacting with the system
and illustrating their interactions. Here's a simplified representation of the use case
diagram:
Use Cases:
Collect Data:
Description: The system collects diverse data sources related to patients' smoking
habits, environmental exposure, genetics, etc.
Actors: System
Preprocess Data:
Description: The system cleans, preprocesses, and prepares the collected data for
model training.
Actors: System
Train Model:
Description: The system utilizes machine learning algorithms to train the
predictive model based on the preprocessed data.
Actors: System
Validate Model:
Description: The system evaluates the trained model's performance using
validation techniques.
Actors: System
Provide Risk Assessment:
Description: Healthcare professionals interact with the system to obtain
personalized lung cancer risk assessments for patients.
Actors: Healthcare Professional, System
Present Recommendations:
Relationships:
Healthcare Professional --> Provide Risk Assessment --> System: Initiates the
request for patient-specific risk assessment.
Healthcare Professional --> Present Recommendations --> System: Receives
personalized recommendations based on the risk assessment.
System --> Collect Data --> System: Collects diverse data sources for model
training.
System --> Preprocess Data --> System: Cleans and prepares collected data for
training.
System --> Train Model --> System: Utilizes data to train the predictive model.
This use case diagram outlines the primary interactions between the actors (healthcare
professionals and the system) and the key functionalities involved in the development and
utilization of the predictive model for lung cancer risk assessment.
A use case is a representation of interactions between an actor (an external entity, which
can be a user or another system) and a system. It describes the functionality or behavior
of a system from the perspective of its users. Each use case represents a specific goal or
action that an actor wants to achieve when interacting with the system.
Actors: Represent entities interacting with the system. They can be users, external
systems, or any other role that engages with the system to achieve specific tasks.
Description: Details the specific functionality or behavior associated with the use
case.
Trigger: Describes the event or condition that initiates the use case.
Preconditions: Specifies any conditions that must be true for the use case to start.
Postconditions: States the expected outcome or state of the system after the use
case is completed successfully.
Flow of Events: Describes the sequence of steps or actions that occur when the
use case is executed. It typically includes the main flow (basic course of actions)
and alternative flows (exceptions or variations).
Preconditions:
The system has collected and preprocessed relevant patient data.
The machine learning model for lung cancer risk assessment is trained and
validated.
Postconditions:
The healthcare professional receives the personalized risk assessment for the
patient(s).
The system maintains the confidentiality and security of patient data.
Flow of Events:
Healthcare Professional requests risk assessment: The healthcare professional
logs into the system and provides patient-specific information required for the risk
assessment.
System processes the request: The system utilizes the trained predictive model to
analyze the provided data and generates a personalized risk assessment.
System presents risk assessment: The system displays the risk assessment
results to the healthcare professional, providing insights into the patient's
likelihood of developing lung cancer.
Healthcare Professional reviews and interprets the assessment: The healthcare
professional interprets the risk assessment and uses it to inform further medical
decisions or interventions.
Exceptions:
If the system encounters errors in data processing or model failure, it notifies the
healthcare professional and prompts appropriate actions or troubleshooting steps.
Unified Modeling Language (UML) diagram for the machine learning-based predictive
model for lung cancer risk assessment involves various components such as class
diagrams, activity diagrams, sequence diagrams, and more. For the purposes of this
project, let's create a high-level UML diagram outlining the main components and their
interactions:
Classes:
Data Collector: Responsible for collecting diverse data sources.
Data Preprocessor: Handles data cleaning, formatting, and preprocessing tasks.
Model Trainer: Utilizes machine learning algorithms to train the predictive
model.
Activities:
Collect Data: DataCollector gathers data from various sources.
Preprocess Data: Data Preprocessor cleans and prepares the collected data.
Train Model: Model Trainer uses data to train the predictive model.
Validate Model: Model Validator assesses the model's performance.
Provide Risk Assessment: Interaction between Healthcare Professional and
Predictive Model to obtain risk assessments.
Present Recommendations: Predictive Model presents actionable
recommendations based on risk assessments.
Sequence Diagram:
A sequence diagram shows the interactions between objects in a specific scenario or use
case.
Sequence:
Healthcare Professional -> Provide Risk Assessment -> Predictive Model:
Healthcare Professional initiates a request for risk assessment.
Predictive Model -> Provide Risk Assessment -> Healthcare Professional:
Predictive Model generates and provides risk assessment to Healthcare
Professional.
Healthcare Professional -> Present Recommendations -> Predictive Model:
Healthcare Professional receives and interprets the recommendations.
This UML diagram provides a high-level overview of the system's components (classes),
their relationships, and the flow of activities (activity and sequence diagrams) involved in
the development and utilization of the predictive model for lung cancer risk assessment. It
serves as a visual representation to understand the system's structure and behavior at a
conceptual level.
State:
In the context of the lung cancer risk assessment model:
Action:
Actions refer to the transformation of the system's state. In this context:
Collect Data Action: Involves the collection of diverse data sources related to lung
cancer risk factors.
Preprocess Data Action: Cleansing, formatting, and preparing the collected data
for model training.
Train Model Action: Utilizes the preprocessed data to train the machine learning
model.
Validate Model Action: Evaluates and validates the trained model's performance
using cross-validation or other techniques.
Predict Risk Action: Involves using the trained and validated model to predict
lung cancer risk for individuals.
Model:
The model here represents the machine learning model itself, developed to predict lung
cancer risk based on various input factors.
Conclusion:
The system algorithm for developing a machine learning-based predictive model for lung
cancer risk assessment involves a diverse set of algorithms encompassing data collection,
preprocessing, model development, evaluation, interpretability, and deployment. The
choice of algorithms depends on factors like data characteristics, model complexity,
interpretability requirements, and deployment environments, among others. These
algorithms collectively contribute to creating a robust and accurate predictive tool for
lung cancer risk assessment.
Decision Tree
Role in Lung Cancer Risk Assessment:
Feature Importance:
Decision trees help identify the most crucial features influencing lung cancer risk by
assessing feature importance. Attributes like smoking habits, environmental pollutants,
genetic predisposition, etc., are ranked based on their contribution to classification.
Interpretability:
Decision trees can capture non-linear relationships between input factors and lung cancer
risk, which might be crucial as certain risk factors might not have a linear impact on the
risk.
Latency:
Training Latency:
Training a machine learning model involves processing the collected data, feature
engineering, algorithm execution, hyperparameter tuning, and model validation. The
duration can range from minutes to several hours or even days, depending on the dataset
size, algorithm complexity, and available computational resources.
Prediction Latency:
Once the model is trained and deployed, the time taken to predict lung cancer risk for an
individual depends on:
Model Complexity: Simple models like logistic regression might have lower
prediction times compared to complex models like deep neural networks.
Size of Input Data: Larger input data or higher dimensionality may increase
prediction time.
Hardware and Software Infrastructure: Utilization of powerful hardware
(GPUs/TPUs) and optimized software frameworks can reduce prediction latency.
Performance:
Model Performance Metrics:
Accuracy: The ability of the model to correctly predict lung cancer risk.
Precision: Proportion of correctly predicted positive instances (lung cancer cases)
among all instances predicted as positive.
Recall: Proportion of correctly predicted positive instances among all actual
positive instances.
F1-score: Harmonic mean of precision and recall, balancing both metrics.
ROC-AUC: Area under the Receiver Operating Characteristic curve, assessing the
model's ability to distinguish between classes.
Validation and Testing:
Background Lung cancer is the second most common cancer in incidence and the leading
cause of cancer deaths
worldwide. Meanwhile, lung cancer screening with low-dose CT can reduce mortality.
The UK National Screening
Committee recommended targeted lung cancer screening on Sept 29, 2022, and asked for
more modelling work to be
done to help refine the recommendation. This study aims to develop and validate a risk
prediction model—the
CanPredict (lung) model—for lung cancer screening in the UK and compare the model
performance against
seven other risk prediction models.
Methods For this retrospective, population-based, cohort study, we used linked electronic
health records from
two English primary care databases: QResearch (Jan 1, 2005–March 31, 2020) and
Clinical Practice Research
Fig 8: ROC curves for risk prediction models in the MOLTEST BIS cohort.
ROC, receiver operating characteristic curve; LLP, Liverpool Lung Project;
AUC, area under the receiver operating characteristic curve. [30]
D statistics were 2·8 in the QResearch (validation) cohort and 2·4 in the CPRD cohort.
Compared with seven other
lung cancer prediction models, the CanPredict (lung) model had the best performance in
discrimination, calibration,
Fig 9: Graphs
Moreover, emphasis has been placed on interpretability and explain ability, incorporating
methods for understanding feature importance and visualizing the model's decision-
making process. This strategic approach aims to enhance the model's transparency and
facilitate the comprehension of predictions by healthcare professionals. Latency in both
Lung cancer is the major cause of cancer-related death in this generation, and it is
expected to remain so for the foreseeable future. It is feasible to treat lung cancer if the
symptoms of the disease are detected early. It is possible to construct a sustainable
prototype model for the treatment of lung cancer using the current developments in
computational intelligence without negatively impacting the environment. Because it will
reduce the number of resources squandered as well as the amount of work necessary to
complete manual tasks, it will save both time and money. To optimise the process of
detection from the lung cancer dataset, a machine learning model based on support vector
machines (SVMs) was used. Using an SVM classifier, lung cancer patients are classified
based on their symptoms at the same time as the Python programming language is utilised
to further the model implementation. The effectiveness of our SVM model was evaluated
in terms of several different criteria. Several cancer datasets from the University of
California, Irvine, library was utilised to evaluate the evaluated model. As a result of the
favourable findings of this research, smart cities will be able to deliver better healthcare
to their citizens. Patients with lung cancer can obtain real-time treatment in a cost-
effective manner with the least amount of effort and latency from any location and at any
time. The proposed model was compared with the existing SVM and SMOTE methods.
The proposed method gets a 98.8% of accuracy rate when comparing the existing
methods.
The data was located in the machine learning repository at UCI, and there are 32
examples in the dataset, each having 57 features and a notional range of 0-3 for all
predictive attributes. This is accomplished by translating nominal attribute and class label
data into binary form, which makes data analysis easier to perform. The conversion of
data from nominal to binary form is the most widely used and standardized method in
data analysis. There are some missing values in the dataset, which has an impact on the
performance of the algorithm; therefore, caution should be exercised when analyzing the
data. The label has three different levels of severity: high, medium, and low. There is a
significant amount of missing data in the input data. As a result, it is important to prepare
the data in such a way that the missing values are replaced with the value that occurs the
most frequently in the column. Following that, the newly processed data is subjected to
The method proposed is the most efficient method. This is because of the computations
that exist in this system. That is, after the given data is included, many of the data in the
fifth text are compared with its various formats and analyzed. These analysis methods
compute its structure and dimensions when comparing the given data with the many data
present in the other datasets attached to it. The various data available in such calculations
will define its boundaries. The changes in its boundaries when small cooks are attached to
each other help to calculate it more accurately when analyzing its various shape models.
Thus, its accuracy is high.
As demonstrated by the evaluation findings, SVM with SMOTE resampling (Figures 3–8)
on two iterations of the Lung Cancer dataset produced the greatest performance on the
dataset. When compared to earlier methods, this method achieves the maximum value for
all of the parameters that were investigated. The study has two minorities participating in
our lung cancer data collection. As a result, after two rounds of SMOTE, there is an equal
distribution of minorities among the two classes. The third run of SMOTE generates
synthetic samples for class B, which had previously been the majority class in the
previous steps. Nonetheless, the classification performance of these samples does not
increase. The best way to use SVM and SMOTE is to do both of them twice on the same
dataset.
The development of a machine learning-based predictive model for lung cancer risk
assessment holds several practical uses and significant implications within healthcare and
beyond:
Practical Uses:
Implications:
Early Detection and Improved Outcomes:
Early identification of individuals at risk may lead to early detection of lung cancer,
potentially improving treatment outcomes by enabling timely intervention and
management.
Ethical Considerations:
Handling sensitive health-related data and making predictions about an individual's health
condition raises ethical concerns regarding patient privacy, data security, informed
consent, and fair use of predictive analytics in healthcare.
Health Equity and Accessibility:
Ensuring equitable access to risk assessment tools and interventions is crucial to prevent
exacerbating health disparities among different socioeconomic groups or regions.
Continuous Improvement and Validation:
Ongoing validation, refinement, and improvement of the model are critical to maintaining
accuracy, especially considering the evolving nature of medical data and healthcare
practices.
Overall Impact:
Roadblocks
Developing a machine learning-based predictive model for lung cancer risk assessment
involves several challenges and roadblocks that can hinder the project's progress. Some of
the key roadblocks include:
Feasibility Analysis
A feasibility analysis for a machine learning-based predictive model for lung cancer risk
assessment involves evaluating various aspects to determine the project's viability, including
technical, economic, operational, and scheduling feasibility.
Technical Feasibility:
Data Availability and Quality: Assess the availability of diverse data sources
containing relevant factors like smoking habits, environmental exposure, genetic
predisposition, etc. Evaluate data quality, considering completeness, consistency, and
potential biases.
Technology and Tools: Determine the feasibility of employing suitable technologies,
algorithms, and tools for data preprocessing, model development, validation, and
deployment. Consider hardware and software requirements for computational
resources.
Economic Feasibility:
Operational Feasibility:
Resource Availability: Assess the availability of skilled personnel, domain experts
(healthcare professionals), and IT infrastructure needed for model development,
implementation, and ongoing maintenance.
Integration with Healthcare Systems: Determine the feasibility of integrating the
predictive model into existing healthcare systems or workflows while ensuring
compatibility and acceptance by healthcare professionals.
Scheduling Feasibility:
Timeline and Milestones: Evaluate the feasibility of meeting project deadlines,
considering the complexities involved in data collection, preprocessing, model
development, validation, and deployment.
Risk Assessment and Mitigation: Identify potential risks (e.g., data quality issues,
model performance limitations, regulatory hurdles) and develop mitigation strategies
to address them.
Despite the value of lung cancer screenings, only 2-4 percent of eligible patients in the
U.S. are screened today. This work demonstrates the potential for AI to increase both
accuracy and consistency, which could help accelerate adoption of lung cancer screening
worldwide.
These initial results are encouraging, but further studies will assess the impact and utility
in clinical practice. We’re collaborating with Google Cloud Healthcare and Life Sciences
team to serve this model through the Cloud Healthcare API and are in early conversations
with partners around the world to continue additional clinical validation research and
deployment.
[1] M.I. Faisal, S. Bashir, Z.S. Khan, F.H. Khan, “An evaluation of machine learning
classifiers and ensembles for early-stage prediction of lung cancer” December 2018 3rd
International Conference on Emerging Trends in Engineering, Sciences and Technology
(ICEEST), IEEE (2018), pp. 1-4
[2] J. Cabrera, A. Dionisio and G. Solano, "Lung cancer classification tool using microarray
data and support vector machines", Information Intelligence Systems and Applications
(IISA), 2015, July, 2015.
[3] Z. Yu, X. Z. Chen, L. H. Cui, H. Z. Si, H. J. Lu and S. H. Liu, "Prediction of lung cancer
based on serum biomarkers by gene expression programming methods", Asian Pacific Journal
of Cancer Prevention, vol. 15, no. 21, pp. 9367-9373, 2014.
[4] H. Shin, S. Oh, S. Hong, M. Kang, D. Kang, Y.G. Ji, Y. Choi “Early-stage lung cancer
diagnosis by deep learning-based spectroscopic analysis of circulating exosomes” ACS
Nano, 14 (5) (2020), pp. 5435-5444
[5] S.H. Hyun, M.S. Ahn, Y.W. Koh, S.J. Lee “A machine-learning approach using PET-
based radiomics to predict the histological subtypes of lung cancer” Clin. Nucl. Med., 44
(12) (2019), pp. 956-960
[6] W. Rahane, H. Dalvi, Y. Magar, A. Kalane, S. Jondhale “Lung cancer detection using
image processing and machine learning healthcare” 2018, March International Conference
on Current Trends towards Converging Technologies (ICCTCT), IEEE (2018), pp. 1-5
[7] B. A. Miah and M. A. Yousuf, "Detection of Lung cancer from CT image using Image
Processing and Neural network", 2nd International Conference on Electrical Engineering and
Information and Communication Technology (ICEEICT), May 2015.
[8] B.V. Ginneken, B. M. Romeny and M. A. Viergever, "Computer-aided diagnosis in chest
radiography: a survey", IEEE transactions on medical imaging, vol. 20, no. 12, 2001.
[9] H. Becker, W. Nettleton, P. Meyers, J. Sweeney and C. Nice, Jr., "Digital computer
determination of a medical diagnostic index directly from chest X-ray images", IEEE Trans.
Biomed. Eng., vol. BME-11, pp. 67-72, 1964.
[10] L. S. Kovasznay and H. M. Joseph, "Image processing", Proc. IRE, vol. 43, pp. 560-570,
May 1955.
[11] P. C. Goldmark and J. M. Hollywood, "A new technique for improving the sharpness of
pictures", PRoc. I.R.E., vol. 39, pp. 1314, October 1951.
[12] Bedford and Fredendall, "Analysis synthesis and evaluation of the transient response of
television apparatus", Proc. I.R.E., vol. 30, pp. 453-455, October 1942.
[13] J. Duncan and N. Ayache, "Medical image analysis: Progress over two decades and the
challenges ahead", IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 85-106, Jan. 2000.
Import Libraries
!pip install dtreeviz
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import warnings
warnings.filterwarnings("ignore")
Load Data
df = pd.read_csv("/kaggle/input/cancer-patients-and-air-pollution-a-new-link/cancer
patient data sets.csv")
df
df.dtypes
plt.title("Correlation Matrix")
sns.heatmap(df_corr, cmap='viridis')
sea = sns.FacetGrid(df, col = "Level", height = 4)
sea.map(sns.distplot, "Age")
df['Level'].value_counts()
for i in range(24):
sns.displot(df['Level'], kde=True)
y = df.Level.values
y
Random Forest
model_rf = RandomForestClassifier()
model_rf.fit(x_train, y_train)
y_pred_rf = model_rf.predict(x_test)
perform(y_pred_rf)
viz_model = dtreeviz.model(model_dt,
X_train=x_train, y_train=y_train,
feature_names=feature_names,
target_name='Lung Cancer',
class_names=['Low', 'Medium', 'High'])
viz_model.view()