0% found this document useful (0 votes)
94 views

Sean Jordan Synthesis Paper

The document discusses how advanced categorical data analysis can impact the healthcare industry. It describes how large datasets in healthcare require powerful analysis methods beyond traditional sorting algorithms. Machine learning algorithms were developed to learn relationships in datasets, but they primarily work with numerical data and struggle with categorical attributes common in medical data. New Python libraries like NumPy, SciPy, Pandas, and Scikit-learn have improved capabilities for numerical analysis and machine learning, but still have limitations for categorical data analysis that is needed in healthcare.

Uploaded by

api-306603109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Sean Jordan Synthesis Paper

The document discusses how advanced categorical data analysis can impact the healthcare industry. It describes how large datasets in healthcare require powerful analysis methods beyond traditional sorting algorithms. Machine learning algorithms were developed to learn relationships in datasets, but they primarily work with numerical data and struggle with categorical attributes common in medical data. New Python libraries like NumPy, SciPy, Pandas, and Scikit-learn have improved capabilities for numerical analysis and machine learning, but still have limitations for categorical data analysis that is needed in healthcare.

Uploaded by

api-306603109
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

The Impacts of Advanced Categorical Data Analysis on the Healthcare Industry

Sean Jordan
Beth Dungey
Philip Graff
9 May 2018
Intern/Mentor I G/T

Introduction
Jordan 1

Back in the old times of the world, all transactions and all records were kept on paper and

stored in large data banks because this was the best method available at the time. These banks

would grow to massive sizes to accommodate entire cities and civilizations (Dumon, 2017). Fast

forward to the mid 20th-century and the computer was invented. Scientists found that this was a

much faster and more reliable way to keep track of things since computer circuits could process

information at a much higher rate than humans could ever dream of doing themselves (Mahoney

1988). The data was put into organized data tables called spreadsheets. These spreadsheets of

data, accompanied with simple and eventually more complex sorting algorithms, were enough to

satisfy all of the world’s needs. But then, things started to change. The population continued to

grow, and the data analysis algorithms that the world used were just not powerful enough to sort

through millions upon billions of data values and draw meaningful conclusions from them (Dean

& Ghemawat, 2004). Scientists quickly realized that a new solution to this growing dilemma was

in order. Thus, the big data era began.

One industry particularly affected by the increase in population and the big data

revolution is the medical field. Healthcare providers need to keep accurate and up-to-date records

of each patient that they take care of. The only problem is that they take care of so many patients,

and their datasets containing all of the patients’ valuable data gets really big, really fast (Dinov,

2016). Medical professionals need a reliable way to analyze such sets to identify specific

subpopulations and risk factors. Within the past decade, remarkable improvements have been

made in the field of big data analysis, specifically in the field of computer programming. While

standard libraries have given positive results in the past when analyzing datasets, new algorithms

can be used to build upon these results and give a more accurate and more effective measurement

of the given data.


Jordan 2

Literature Review

A major innovation in the field of data analysis was machine learning, which allows

algorithms to “learn” certain relationships within datasets. The first type of machine learning,

unsupervised machine learning, involves inferring relationships from unlabeled data, data that

does not have a categorization or output label. This type of learning mainly seeks to analyze data

and form subpopulations that give light to different inferences. Supervised machine learning, on

the other hand, uses data that is categorized based on its input values (Bejar, 2014). This means

that a set of inputs in a data point are linked to an output. This type of learning mainly seeks to

make predictions about data points with unknown categorizations from studying data points that

do have labels to output values. This process has been implemented in the Python programming

language through the scikit-learn library, which provides a variety of algorithms to support both

supervised and unsupervised machine learning (Geron, 2017). Machine learning can be applied

to the healthcare industry by helping to predict a patient’s chances of having a particular disease

based on his or her symptoms and risk factors. However, there is a major problem. The

aforementioned algorithms mainly work with numerical datasets. When categorical attributes

start showing up in datasets that need to be analyzed, these quick and effective functions cannot

be used. When categorical analysis is used, the process takes significantly longer and less useful

information is discovered, making data analysis a hassle for large corporations (He, Xu, & Deng,

2007). Many medical institutions have categorical attributes in their patient datasets, which has

led to the constantly growing problem of how to gather useful information from categorical and

mixed datasets in hospitals.


Jordan 3

Innovative libraries such as NumPy, SciPy, and Pandas have revolutionized the way that

the Python programming language works with numerical data sets, allowing quick and

meaningful analysis of large sets of

numbers with tools such as matrices, arrays,

and dataframes.

NumPy is a fundamental Python library that

adds support for multidimensional arrays

and matrices. This library is used

extensively with other libraries as a

dependency due to its universal applications in mathematics. While NumPy can do many

different things, at its core it excels at creating n-dimensional arrays that are scalable and easy to

access (“What is NumPy”, 2015). These arrays, which the vanilla Python programming language

lacks support for, are widely used for organizing large clumps of data into lists that can be easily

modified. However, the method for creating these specialized arrays does not come from Python

itself - it can be traced back to the programming language C, which has a similar object-oriented

design (2015). Numpy is efficient to use specifically for medical institutions because these

companies oftentimes have a large amount of patient data that they need to store. The specialized

multi-dimensional arrays that NumPy creates can help organize the data in a way that can be

easily read and, most importantly, modified, by other algorithms such as those in SciPy and

Pandas (Paruchuri, 2016).

SciPy, which is built upon the core methods added by NumPy, adds programming

functionality to many of the processes seen in the science, technology, and engineering fields.

The library encompasses all of the advanced mathematical computations that are seen in
Jordan 4

scientific calculations, such as special functions, integration, differential equations, and more

(“Frequently Asked Questions”, n.d.). Like NumPy, SciPy is not actually written in Python. It is

built on a combination of the C and Fortran

programming languages, which help streamline

the algorithm design and, ultimately, make it

faster and more efficient. This high efficiency is

key, which is why it is so popular among machine

learning and data analysis engineers (Akinfaderin,

2017). In the medical field, determining false

positives and false negatives are vitally important. These calculations are typically done by

integrating the Receiver Operating Characteristic (ROC) curve, which requires mathematical

calculations like those provided by SciPy (Tsanas, Little, & McSharry, 2013).

Pandas is another popular data analysis library that is built extensively off of NumPy.

This dataset offers support for organizing large clumps of data into visually appealing datasets

that can be easily modified for almost any use. This aides in the task of performing mathematical

calculations on the values within the dataset as well as presenting the dataset to the programmer

or others. The central addition of Python is the DataFrame, which acts as an object that contains

all of the data in a way that neatly sorts it into a table upon function call. These DataFrames can

be created, spliced, rearranged, cut, appended to, and more with just a quick function call. These

functions are all designed to benefit the person analyzing the data (“Pandas,” n.d.). Pandas,

unlike NumPy and SciPy, has a good portion of its code written in Python (the rest is in C). This

is essential because, while working on DataFrames, there are many transformations happening

between DataFrames and Python lists/arrays in order to execute the called function. This requires
Jordan 5

enough support for Python arrays, which mandates that they be accessed with Python code to

make the process go as quickly and as cleanly as possible. As aforementioned, healthcare

institutions quickly accrue very large patient datasets that they need to store. Pandas is useful

because it would allow data analysts to easily sort the data and extrapolate meaningful data from

it in a way that is useful to the company. Another Python library that is vital to GALILEO is

Scikit-learn.

Scikit-learn is a python library designed exclusively for machine learning. It is packed

full of algorithms devoted to both

unsupervised machine learning as well as

sup

ervi

sed

ma

chi

ne learning. Since Python is an object-oriented

programming language that is widely considered to be adept at data manipulation and efficiency,

the combination of Python with a library such as scikit-learn allows for fast, efficient, and

accurate machine learning, which is quickly becoming more and more necessary in today’s

digitally-enhanced world.

These four Python libraries - NumPy, SciPy, Pandas, and Scikit-Learn - have

revolutionized the way that programmers work with datasets. The algorithms that these libraries

contain can be easily applied to the medical field in order to create useful data that can help save

both money and lives.


Jordan 6

The field of statistics and probability analysis as it relates to computer science has also

revolutionized the way that big data is examined by researchers. Statistical values such as the

mean, range, and standard deviation have allowed researchers to analyze numerical data sets

quickly. The mean, or average, of a dataset is a representation of the central tendency of that

dataset and helps collectively represent a population with one value. The range of a dataset

exhibits the lowest and highest endpoints of a particular set of data points, and helps set up

bounds for a cluster of data points and/or a dataset. The standard deviation of a dataset represents

the dispersion or deviation of a set of values from the group as a whole and helps denote the

accuracy or concentration of a dataset in relation to that dataset’s mean. In addition to these

statistical values, there are also probability analysis techniques to help better understand a dataset

and its values.


Jordan 7

The first major probability theory is the Bayesian model. Unlike other probability

models, the Bayesian model focuses on the reasonably expected. It uses random variables (thus

making it entropy-based) to model sources of uncertainty in a dataset in hopes to predict future

points accurately and efficiently. In order to do this, Bayes’ theorem is used, which says that the

probability of one event, A, being true given that a second event, B, is true is equal to the

probability of the B being true given that A is true multiplied by the probability of A being true

independent of B. This

whole product is then

divided by the probability of

B being true independent of

A. The image to the right

shows a mathematical representation, with the bar (|) representing given. The whole purpose of

Bayesian probability analysis is to test the truthfulness of a specified hypothesis versus the given

data and create a conditional probability value as a result ("LaMorte, 2016, “Baye's Theorem”).

Another major probability theory used extensively in computer-based statistics is the

Gaussian model. The Gaussian model, also called the normal model, is a very common

distribution that happens to be a continuous outcome model. The main point of the model is to

allow statisticians to predict the probability of a certain value occuring in a dataset. This

information is used to create better models of datasets that are applicable to the real world

(LaMorte, 2017, “The Normal Distribution”). The Gaussian model is essentially a continuous

graph. On the x-axis, it graphs the value of the variable that it is measuring. On the y-axis, the

frequency at which that value occurs in a hypothetical dataset is plotted. The Gaussian model is

based off of the fact that the dataset is normal. This means that there is an equal proportion of
Jordan 8

data points both below and above the mean value. The median and the mode are also in the

center of the graph. In a skewed dataset, the graph is weighted towards either the left or right

sides, which causes the mean, median, and mode to be in different parts of the curve.

While libraries such as NumPy, SciPy, Pandas, and Scikit-learn have revolutionized the

way that people handle and predict big data, these libraries are currently only effective with

quantitative data, not categorical or mixed data.

Healthcare providers are getting overwhelmed

with the amount of patients that they need to

treat, and much of the data that they store on the

patient’s behalf is categorical or mixed data.

Hospitals are having a hard time digesting this

data and extrapolating useful information from

it. Without a reliable and efficient method to

organize and analyze large patient datasets with categorical or mixed attributes, the accuracy of

their physicians’ diagnosis will start to falter due to long hours of work. Galileo, the categorical

data analysis tool from the JHU/APL, will help to solve this problem by creating various clusters

of subpopulations based on their attributes and values. This will allow medical personnel to

easily extrapolate extremely useful information from complex datasets, which lowers the time

and cost of diagnosis and care while improving the accuracy and precision. This includes helping

to reduce the chances of a misdiagnosis of a disease, such as a false positive or false negative.

Galileo can revolutionize the way the world handles data, especially in the medical field.

Data Collection/Methods

In order to answer the research question of how to best gather valuable knowledge on
Jordan 9

large patient datasets with categorical and mixed attributes, Galileo, a new unsupervised learning

algorithm developed at the

JHU/APL was used to draw

relationships between patient data

for the diagnostic process. It was hypothesized that Galileo would more effectively draw

relationships and decrease the costs and time involved in the diagnostic process. Galileo uses

entropy-based data metrics in order to cluster datasets into pools of data points that share similar

attributes. A general mixture model is used in order to cluster the data into groups that have

similar data points (Graff, Savkli, Lin, & Kinsey, 2017). This finite mixture model uses

algorithms derived from both Naïve Bayes and Gaussian mixture models, making it a universal

solution for solving complex data problems. This

model is further described by the superposition

of probability distributions for clusters in the

dataset.

Standard and random variables are

present throughout the Galileo, making data

analysis random each time it is run. The k-value determines the starting and maximum number of

clusters in the dataset. Additionally, random points are selected as the seeds for the initial

clusters in Galileo.

Results and Data Analysis

The Mushrooms Dataset


Jordan 10

The Mushrooms Dataset was pulled from the University of California - Irvine (UCI) Machine

Learning Repository. It consisted of 23 attributes (columns) and 8124 rows of data. The dataset

contained different letters describing different characteristics of mushrooms.

Results

When GALILEO was run on the Mushrooms Dataset, the optimal clusters found was 23. This

was the same value for the Akaike, Bayesian, and Density Information Criterion. The

consistency seen in this result is promising. The resulting confusion matrix shows intense but

precise areas of concentration, which denotes distinct clusters with well-defined boundaries.

Additionally, even though it was not included in the dataset in any way, GALILEO found that

there were 23 different species of mushrooms used in the dataset. This shows that the algorithm

really does have potential for unsupervised machine learning on large data sets such as this one.

This was a benchmark test that gave proof of what GALILEO could do.
Jordan 11

The Cleveland Heart Disease Dataset

The Heart Disease Dataset was pulled from the University of California - Irvine (UCI) Machine

Learning Repository. It consisted of 14 attributes (columns) and 313 rows of data. The dataset

contained different words and numbers representing different symptoms of heart disease, like

amount of cholesterol or body parts affected by pain.

Results

The results generated by GALILEO for the Heart Disease Dataset were less than satisfactory.

GALILEO found the best number of clusters to be 2, which is the absolute minimum value

allowed in the program code. Additionally, the resulting confusion matrix generated by Scikit-

learn shows a lot of intense population over a wide spread of area. This denotes that there is a lot
Jordan 12

of blurred lines between clusters, and the algorithm doesn’t really know where to put certain

points. It was concluded that this unsatisfactory result was due to two main issues. The first

problem was the presence of too many unique values in the attribute space. For example, the

“cholesterol” attribute had numbers such as ‘291.4’ and ‘291.5’. GALILEO treats these as

completely different numerical values, and the presence of 200+ unique data points in each

attribute caused GALILEO to fail. Additionally, there were only 313 instances of data in this set,

compared to the 8000+ seen in the Mushrooms dataset. With any type of machine learning, the

more instances of data a set has, the better results that become of it when it is analyzed. These

two problems caused GALILEO to perform very poorly when trying to analyze the Cleveland

Heart Disease Data Set.

The CASP Protein Dataset

The CASP Protein Dataset was pulled from the University of California - Irvine (UCI) Machine

Learning Repository. It consisted of 9 attributes (columns) and 46000 rows of binned data. The

dataset contained different numerical values representing different physicochemical

characteristics of certain species of proteins found in the human body.

Results
Jordan 13

For this dataset, GALILEO performed noticeably better than on the Cleveland Heart Disease

Data Set. This data set was designed to fix the problems found in the Heart Disease data set. It

was a lot larger - around 46000 instances of data were used (it was cut to 10000 due to a memory

overload error). This gave it a lot more instances to work with, which helped the unsupervised

and supervised machine learning processes greatly. Additionally, the data values within the

attributes were binned by groups of numbers (which were dependent on the range of values in

that attribute). In the end, each attribute only had 10 unique values, instead of the 200-300

without binning. This decrease in unique values helped GALILEO connect the dots more and

perform better in the end. The GALILEO algorithm found the best number of clusters to be 23. It

turns out that the original dataset had 21 different species of proteins cataloged (this was NOT

exhibited in the dataset at all). The resulting confusion matrix was a lot better. Even though it

showed some error/spread, it showed a few distinct clusters with pretty well-defined boundaries.

So, GALILEO was not far off in its machine learning estimations, which shows a lot of promise

when it comes to analyzing datasets. However, it is worth noting that this data set was only

quantitative in nature, not mixed or categorical.


Jordan 14

Conclusion

The results from the three experiments clearly show that GALILEO has an enormous

amount of potential in the unsupervised machine learning field. It excels in finding new

information about datasets that wasn’t explicitly stated within the dataset itself. It found that

there were 23 species of mushrooms without it knowing that there were 23 species to begin with.

It nearly found 21 species for the CASP Protein Data Set. Based off these, GALILEO can have

an enormous impact on not just the medical field, but other fields such as archaeology and

economics as well. The possibilities are endless. With that being said, GALILEO does have a

small set of requirements for it to work as intended. The first is that there needs to be either a

small set of unique values for each attribute OR there needs to be binning put into place by the

researcher. This enables connections to be made between clusters in the data set. Secondly, the

dataset must be large enough to gain enough information for the unsupervised machine learning

to actually take place. With more work and testing, GALILEo can be fine tuned to make it more

universally applicable to all kinds of datasets. From there, ti can start making a lasting positive

impact on the field of big data analysis.


Jordan 15

References

Akinfaderin, W. (2017, March 23). The mathematics of machine learning. Retrieved from

https://towardsdatascience.com/the-mathematics-of-machine-learning-894f046c568

Béjar, J. (2014, September). Unsupervised machine learning and data mining. Retrieved

from http://www.cs.upc.edu/~bejar/amlt/material/AMLTTransBook.pdf

Brownlee, J. (2016, May 16). Visualize machine learning data in Python with Pandas.

Retrieved from https://machinelearningmastery.com/visualize-machine-learning-data-

python-pandas/

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large

clusters. Retrieved from http://web.mit.edu/6.033/www/papers/mapreduce-osdi04.pdf


Jordan 16

Dinov, I. D. (2016, March 17). Volume and value of big healthcare data. Retrieved from

HHS Public Access website:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/pdf/nihms-766954.pdf

Dumon, M. (2017, December 14). Ancient accounting systems. Retrieved from

https://www.investopedia.com/articles/financialcareers/09/ancient-accounting.asp

Frequently asked questions. (n.d.). Retrieved from

https://www.scipy.org/scipylib/faq.html

Geron, A. (2017). The fundamentals of machine learning. In Hands-on machine learning

with Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent

systems (pp. 3-228). Sebastopol, CA: O’Reilly.

Graff, P., Savkli, C., Lin, J., & Kinsey, M. (2017, July 18). GALILEO:

Generalized low-entropy

mixture model [Microsoft Powerpoint].

He, Z., Xu, X., & Deng, S. (2007). Attribute value weighting in k-modes clustering.

Retrieved from Harbin Institute of Technology - Department of Computer Science and

Engineering website: https://arxiv.org/pdf/cs/0701013.pdf

LaMorte, W. W. (2016, July 24). Bayes’s Theorem. Retrieved from The Role of

Probability website: http://sphweb.bumc.bu.edu/otlt/MPH-

Modules/BS/BS704_Probability/BS704_Probability6.html

LaMorte, W. W. (2016, July 24). The normal distribution: A probability model for a

continuous outcome. Retrieved from The Role of Probability website:

http://sphweb.bumc.bu.edu/otlt/MPH-

Modules/BS/BS704_Probability/BS704_Probability8.html
Jordan 17

Mahoney, M. S. (1988). The history of computing in the history of technology. Retrieved

from https://www.princeton.edu/~hos/mike/articles/hcht.pdf

Pandas: A Python data analysis library. (n.d.). Retrieved from

https://pandas.pydata.org/index.html

Paruchuri, V. (2016, October 18). NumPy tutorial: Data analysis with Python. Retrieved

from https://www.dataquest.io/blog/numpy-tutorial-python/

Tsanas, A., Little, M. A., & McSharry, P. E. (2013). [A methodology for the analysis of

medical data]. In J. P. Sturmberg & C. M. Martin (Authors), Handbook of systems and

complexity in health (pp. 113-125). Retrieved from

https://people.maths.ox.ac.uk/tsanas/Preprints/A%20methodology%20for%20the%20anal

ysis%20of%20medical%20data_website.pdf

What is Numpy? (2015). Retrieved from Scipy Foundation website:

https://docs.scipy.org/doc/numpy-1.10.0/user/whatisnumpy.html

You might also like