Sean Jordan Synthesis Paper
Sean Jordan Synthesis Paper
Sean Jordan
Beth Dungey
Philip Graff
9 May 2018
Intern/Mentor I G/T
Introduction
Jordan 1
Back in the old times of the world, all transactions and all records were kept on paper and
stored in large data banks because this was the best method available at the time. These banks
would grow to massive sizes to accommodate entire cities and civilizations (Dumon, 2017). Fast
forward to the mid 20th-century and the computer was invented. Scientists found that this was a
much faster and more reliable way to keep track of things since computer circuits could process
information at a much higher rate than humans could ever dream of doing themselves (Mahoney
1988). The data was put into organized data tables called spreadsheets. These spreadsheets of
data, accompanied with simple and eventually more complex sorting algorithms, were enough to
satisfy all of the world’s needs. But then, things started to change. The population continued to
grow, and the data analysis algorithms that the world used were just not powerful enough to sort
through millions upon billions of data values and draw meaningful conclusions from them (Dean
& Ghemawat, 2004). Scientists quickly realized that a new solution to this growing dilemma was
One industry particularly affected by the increase in population and the big data
revolution is the medical field. Healthcare providers need to keep accurate and up-to-date records
of each patient that they take care of. The only problem is that they take care of so many patients,
and their datasets containing all of the patients’ valuable data gets really big, really fast (Dinov,
2016). Medical professionals need a reliable way to analyze such sets to identify specific
subpopulations and risk factors. Within the past decade, remarkable improvements have been
made in the field of big data analysis, specifically in the field of computer programming. While
standard libraries have given positive results in the past when analyzing datasets, new algorithms
can be used to build upon these results and give a more accurate and more effective measurement
Literature Review
A major innovation in the field of data analysis was machine learning, which allows
algorithms to “learn” certain relationships within datasets. The first type of machine learning,
unsupervised machine learning, involves inferring relationships from unlabeled data, data that
does not have a categorization or output label. This type of learning mainly seeks to analyze data
and form subpopulations that give light to different inferences. Supervised machine learning, on
the other hand, uses data that is categorized based on its input values (Bejar, 2014). This means
that a set of inputs in a data point are linked to an output. This type of learning mainly seeks to
make predictions about data points with unknown categorizations from studying data points that
do have labels to output values. This process has been implemented in the Python programming
language through the scikit-learn library, which provides a variety of algorithms to support both
supervised and unsupervised machine learning (Geron, 2017). Machine learning can be applied
to the healthcare industry by helping to predict a patient’s chances of having a particular disease
based on his or her symptoms and risk factors. However, there is a major problem. The
aforementioned algorithms mainly work with numerical datasets. When categorical attributes
start showing up in datasets that need to be analyzed, these quick and effective functions cannot
be used. When categorical analysis is used, the process takes significantly longer and less useful
information is discovered, making data analysis a hassle for large corporations (He, Xu, & Deng,
2007). Many medical institutions have categorical attributes in their patient datasets, which has
led to the constantly growing problem of how to gather useful information from categorical and
Innovative libraries such as NumPy, SciPy, and Pandas have revolutionized the way that
the Python programming language works with numerical data sets, allowing quick and
and dataframes.
dependency due to its universal applications in mathematics. While NumPy can do many
different things, at its core it excels at creating n-dimensional arrays that are scalable and easy to
access (“What is NumPy”, 2015). These arrays, which the vanilla Python programming language
lacks support for, are widely used for organizing large clumps of data into lists that can be easily
modified. However, the method for creating these specialized arrays does not come from Python
itself - it can be traced back to the programming language C, which has a similar object-oriented
design (2015). Numpy is efficient to use specifically for medical institutions because these
companies oftentimes have a large amount of patient data that they need to store. The specialized
multi-dimensional arrays that NumPy creates can help organize the data in a way that can be
easily read and, most importantly, modified, by other algorithms such as those in SciPy and
SciPy, which is built upon the core methods added by NumPy, adds programming
functionality to many of the processes seen in the science, technology, and engineering fields.
The library encompasses all of the advanced mathematical computations that are seen in
Jordan 4
scientific calculations, such as special functions, integration, differential equations, and more
(“Frequently Asked Questions”, n.d.). Like NumPy, SciPy is not actually written in Python. It is
positives and false negatives are vitally important. These calculations are typically done by
integrating the Receiver Operating Characteristic (ROC) curve, which requires mathematical
calculations like those provided by SciPy (Tsanas, Little, & McSharry, 2013).
Pandas is another popular data analysis library that is built extensively off of NumPy.
This dataset offers support for organizing large clumps of data into visually appealing datasets
that can be easily modified for almost any use. This aides in the task of performing mathematical
calculations on the values within the dataset as well as presenting the dataset to the programmer
or others. The central addition of Python is the DataFrame, which acts as an object that contains
all of the data in a way that neatly sorts it into a table upon function call. These DataFrames can
be created, spliced, rearranged, cut, appended to, and more with just a quick function call. These
functions are all designed to benefit the person analyzing the data (“Pandas,” n.d.). Pandas,
unlike NumPy and SciPy, has a good portion of its code written in Python (the rest is in C). This
is essential because, while working on DataFrames, there are many transformations happening
between DataFrames and Python lists/arrays in order to execute the called function. This requires
Jordan 5
enough support for Python arrays, which mandates that they be accessed with Python code to
institutions quickly accrue very large patient datasets that they need to store. Pandas is useful
because it would allow data analysts to easily sort the data and extrapolate meaningful data from
it in a way that is useful to the company. Another Python library that is vital to GALILEO is
Scikit-learn.
sup
ervi
sed
ma
chi
programming language that is widely considered to be adept at data manipulation and efficiency,
the combination of Python with a library such as scikit-learn allows for fast, efficient, and
accurate machine learning, which is quickly becoming more and more necessary in today’s
digitally-enhanced world.
These four Python libraries - NumPy, SciPy, Pandas, and Scikit-Learn - have
revolutionized the way that programmers work with datasets. The algorithms that these libraries
contain can be easily applied to the medical field in order to create useful data that can help save
The field of statistics and probability analysis as it relates to computer science has also
revolutionized the way that big data is examined by researchers. Statistical values such as the
mean, range, and standard deviation have allowed researchers to analyze numerical data sets
quickly. The mean, or average, of a dataset is a representation of the central tendency of that
dataset and helps collectively represent a population with one value. The range of a dataset
exhibits the lowest and highest endpoints of a particular set of data points, and helps set up
bounds for a cluster of data points and/or a dataset. The standard deviation of a dataset represents
the dispersion or deviation of a set of values from the group as a whole and helps denote the
statistical values, there are also probability analysis techniques to help better understand a dataset
The first major probability theory is the Bayesian model. Unlike other probability
models, the Bayesian model focuses on the reasonably expected. It uses random variables (thus
points accurately and efficiently. In order to do this, Bayes’ theorem is used, which says that the
probability of one event, A, being true given that a second event, B, is true is equal to the
probability of the B being true given that A is true multiplied by the probability of A being true
independent of B. This
shows a mathematical representation, with the bar (|) representing given. The whole purpose of
Bayesian probability analysis is to test the truthfulness of a specified hypothesis versus the given
data and create a conditional probability value as a result ("LaMorte, 2016, “Baye's Theorem”).
Gaussian model. The Gaussian model, also called the normal model, is a very common
distribution that happens to be a continuous outcome model. The main point of the model is to
allow statisticians to predict the probability of a certain value occuring in a dataset. This
information is used to create better models of datasets that are applicable to the real world
(LaMorte, 2017, “The Normal Distribution”). The Gaussian model is essentially a continuous
graph. On the x-axis, it graphs the value of the variable that it is measuring. On the y-axis, the
frequency at which that value occurs in a hypothetical dataset is plotted. The Gaussian model is
based off of the fact that the dataset is normal. This means that there is an equal proportion of
Jordan 8
data points both below and above the mean value. The median and the mode are also in the
center of the graph. In a skewed dataset, the graph is weighted towards either the left or right
sides, which causes the mean, median, and mode to be in different parts of the curve.
While libraries such as NumPy, SciPy, Pandas, and Scikit-learn have revolutionized the
way that people handle and predict big data, these libraries are currently only effective with
organize and analyze large patient datasets with categorical or mixed attributes, the accuracy of
their physicians’ diagnosis will start to falter due to long hours of work. Galileo, the categorical
data analysis tool from the JHU/APL, will help to solve this problem by creating various clusters
of subpopulations based on their attributes and values. This will allow medical personnel to
easily extrapolate extremely useful information from complex datasets, which lowers the time
and cost of diagnosis and care while improving the accuracy and precision. This includes helping
to reduce the chances of a misdiagnosis of a disease, such as a false positive or false negative.
Galileo can revolutionize the way the world handles data, especially in the medical field.
Data Collection/Methods
In order to answer the research question of how to best gather valuable knowledge on
Jordan 9
large patient datasets with categorical and mixed attributes, Galileo, a new unsupervised learning
for the diagnostic process. It was hypothesized that Galileo would more effectively draw
relationships and decrease the costs and time involved in the diagnostic process. Galileo uses
entropy-based data metrics in order to cluster datasets into pools of data points that share similar
attributes. A general mixture model is used in order to cluster the data into groups that have
similar data points (Graff, Savkli, Lin, & Kinsey, 2017). This finite mixture model uses
algorithms derived from both Naïve Bayes and Gaussian mixture models, making it a universal
dataset.
analysis random each time it is run. The k-value determines the starting and maximum number of
clusters in the dataset. Additionally, random points are selected as the seeds for the initial
clusters in Galileo.
The Mushrooms Dataset was pulled from the University of California - Irvine (UCI) Machine
Learning Repository. It consisted of 23 attributes (columns) and 8124 rows of data. The dataset
Results
When GALILEO was run on the Mushrooms Dataset, the optimal clusters found was 23. This
was the same value for the Akaike, Bayesian, and Density Information Criterion. The
consistency seen in this result is promising. The resulting confusion matrix shows intense but
precise areas of concentration, which denotes distinct clusters with well-defined boundaries.
Additionally, even though it was not included in the dataset in any way, GALILEO found that
there were 23 different species of mushrooms used in the dataset. This shows that the algorithm
really does have potential for unsupervised machine learning on large data sets such as this one.
This was a benchmark test that gave proof of what GALILEO could do.
Jordan 11
The Heart Disease Dataset was pulled from the University of California - Irvine (UCI) Machine
Learning Repository. It consisted of 14 attributes (columns) and 313 rows of data. The dataset
contained different words and numbers representing different symptoms of heart disease, like
Results
The results generated by GALILEO for the Heart Disease Dataset were less than satisfactory.
GALILEO found the best number of clusters to be 2, which is the absolute minimum value
allowed in the program code. Additionally, the resulting confusion matrix generated by Scikit-
learn shows a lot of intense population over a wide spread of area. This denotes that there is a lot
Jordan 12
of blurred lines between clusters, and the algorithm doesn’t really know where to put certain
points. It was concluded that this unsatisfactory result was due to two main issues. The first
problem was the presence of too many unique values in the attribute space. For example, the
“cholesterol” attribute had numbers such as ‘291.4’ and ‘291.5’. GALILEO treats these as
completely different numerical values, and the presence of 200+ unique data points in each
attribute caused GALILEO to fail. Additionally, there were only 313 instances of data in this set,
compared to the 8000+ seen in the Mushrooms dataset. With any type of machine learning, the
more instances of data a set has, the better results that become of it when it is analyzed. These
two problems caused GALILEO to perform very poorly when trying to analyze the Cleveland
The CASP Protein Dataset was pulled from the University of California - Irvine (UCI) Machine
Learning Repository. It consisted of 9 attributes (columns) and 46000 rows of binned data. The
Results
Jordan 13
For this dataset, GALILEO performed noticeably better than on the Cleveland Heart Disease
Data Set. This data set was designed to fix the problems found in the Heart Disease data set. It
was a lot larger - around 46000 instances of data were used (it was cut to 10000 due to a memory
overload error). This gave it a lot more instances to work with, which helped the unsupervised
and supervised machine learning processes greatly. Additionally, the data values within the
attributes were binned by groups of numbers (which were dependent on the range of values in
that attribute). In the end, each attribute only had 10 unique values, instead of the 200-300
without binning. This decrease in unique values helped GALILEO connect the dots more and
perform better in the end. The GALILEO algorithm found the best number of clusters to be 23. It
turns out that the original dataset had 21 different species of proteins cataloged (this was NOT
exhibited in the dataset at all). The resulting confusion matrix was a lot better. Even though it
showed some error/spread, it showed a few distinct clusters with pretty well-defined boundaries.
So, GALILEO was not far off in its machine learning estimations, which shows a lot of promise
when it comes to analyzing datasets. However, it is worth noting that this data set was only
Conclusion
The results from the three experiments clearly show that GALILEO has an enormous
amount of potential in the unsupervised machine learning field. It excels in finding new
information about datasets that wasn’t explicitly stated within the dataset itself. It found that
there were 23 species of mushrooms without it knowing that there were 23 species to begin with.
It nearly found 21 species for the CASP Protein Data Set. Based off these, GALILEO can have
an enormous impact on not just the medical field, but other fields such as archaeology and
economics as well. The possibilities are endless. With that being said, GALILEO does have a
small set of requirements for it to work as intended. The first is that there needs to be either a
small set of unique values for each attribute OR there needs to be binning put into place by the
researcher. This enables connections to be made between clusters in the data set. Secondly, the
dataset must be large enough to gain enough information for the unsupervised machine learning
to actually take place. With more work and testing, GALILEo can be fine tuned to make it more
universally applicable to all kinds of datasets. From there, ti can start making a lasting positive
References
Akinfaderin, W. (2017, March 23). The mathematics of machine learning. Retrieved from
https://towardsdatascience.com/the-mathematics-of-machine-learning-894f046c568
Béjar, J. (2014, September). Unsupervised machine learning and data mining. Retrieved
from http://www.cs.upc.edu/~bejar/amlt/material/AMLTTransBook.pdf
Brownlee, J. (2016, May 16). Visualize machine learning data in Python with Pandas.
python-pandas/
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large
Dinov, I. D. (2016, March 17). Volume and value of big healthcare data. Retrieved from
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/pdf/nihms-766954.pdf
https://www.investopedia.com/articles/financialcareers/09/ancient-accounting.asp
https://www.scipy.org/scipylib/faq.html
with Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent
Graff, P., Savkli, C., Lin, J., & Kinsey, M. (2017, July 18). GALILEO:
Generalized low-entropy
He, Z., Xu, X., & Deng, S. (2007). Attribute value weighting in k-modes clustering.
LaMorte, W. W. (2016, July 24). Bayes’s Theorem. Retrieved from The Role of
Modules/BS/BS704_Probability/BS704_Probability6.html
LaMorte, W. W. (2016, July 24). The normal distribution: A probability model for a
http://sphweb.bumc.bu.edu/otlt/MPH-
Modules/BS/BS704_Probability/BS704_Probability8.html
Jordan 17
from https://www.princeton.edu/~hos/mike/articles/hcht.pdf
https://pandas.pydata.org/index.html
Paruchuri, V. (2016, October 18). NumPy tutorial: Data analysis with Python. Retrieved
from https://www.dataquest.io/blog/numpy-tutorial-python/
Tsanas, A., Little, M. A., & McSharry, P. E. (2013). [A methodology for the analysis of
https://people.maths.ox.ac.uk/tsanas/Preprints/A%20methodology%20for%20the%20anal
ysis%20of%20medical%20data_website.pdf
https://docs.scipy.org/doc/numpy-1.10.0/user/whatisnumpy.html