0% found this document useful (0 votes)

94 views

Sean Jordan Synthesis Paper

The document discusses how advanced categorical data analysis can impact the healthcare industry. It describes how large datasets in healthcare require powerful analysis methods beyond traditional sorting algorithms. Machine learning algorithms were developed to learn relationships in datasets, but they primarily work with numerical data and struggle with categorical attributes common in medical data. New Python libraries like NumPy, SciPy, Pandas, and Scikit-learn have improved capabilities for numerical analysis and machine learning, but still have limitations for categorical data analysis that is needed in healthcare.

Uploaded by

api-306603109

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Sean Jordan Synthesis Paper

Uploaded by

api-306603109

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

The Impacts of Advanced Categorical Data Analysis on the Healthcare Industry

Sean Jordan
Beth Dungey
Philip Graff
9 May 2018
Intern/Mentor I G/T

Introduction
Jordan 1

Back in the old times of the world, all transactions and all records were kept on paper and

stored in large data banks because this was the best method available at the time. These banks

would grow to massive sizes to accommodate entire cities and civilizations (Dumon, 2017). Fast

forward to the mid 20th-century and the computer was invented. Scientists found that this was a

much faster and more reliable way to keep track of things since computer circuits could process

information at a much higher rate than humans could ever dream of doing themselves (Mahoney

1988). The data was put into organized data tables called spreadsheets. These spreadsheets of

data, accompanied with simple and eventually more complex sorting algorithms, were enough to

satisfy all of the world’s needs. But then, things started to change. The population continued to

grow, and the data analysis algorithms that the world used were just not powerful enough to sort

through millions upon billions of data values and draw meaningful conclusions from them (Dean

& Ghemawat, 2004). Scientists quickly realized that a new solution to this growing dilemma was

in order. Thus, the big data era began.

One industry particularly affected by the increase in population and the big data

revolution is the medical field. Healthcare providers need to keep accurate and up-to-date records

of each patient that they take care of. The only problem is that they take care of so many patients,

and their datasets containing all of the patients’ valuable data gets really big, really fast (Dinov,

2016). Medical professionals need a reliable way to analyze such sets to identify specific

subpopulations and risk factors. Within the past decade, remarkable improvements have been

made in the field of big data analysis, specifically in the field of computer programming. While

standard libraries have given positive results in the past when analyzing datasets, new algorithms

can be used to build upon these results and give a more accurate and more effective measurement

of the given data.

Jordan 2

Literature Review

A major innovation in the field of data analysis was machine learning, which allows

algorithms to “learn” certain relationships within datasets. The first type of machine learning,

unsupervised machine learning, involves inferring relationships from unlabeled data, data that

does not have a categorization or output label. This type of learning mainly seeks to analyze data

and form subpopulations that give light to different inferences. Supervised machine learning, on

the other hand, uses data that is categorized based on its input values (Bejar, 2014). This means

that a set of inputs in a data point are linked to an output. This type of learning mainly seeks to

make predictions about data points with unknown categorizations from studying data points that

do have labels to output values. This process has been implemented in the Python programming

language through the scikit-learn library, which provides a variety of algorithms to support both

supervised and unsupervised machine learning (Geron, 2017). Machine learning can be applied

to the healthcare industry by helping to predict a patient’s chances of having a particular disease

based on his or her symptoms and risk factors. However, there is a major problem. The

aforementioned algorithms mainly work with numerical datasets. When categorical attributes

start showing up in datasets that need to be analyzed, these quick and effective functions cannot

be used. When categorical analysis is used, the process takes significantly longer and less useful

information is discovered, making data analysis a hassle for large corporations (He, Xu, & Deng,

2007). Many medical institutions have categorical attributes in their patient datasets, which has

led to the constantly growing problem of how to gather useful information from categorical and

mixed datasets in hospitals.

Jordan 3

Innovative libraries such as NumPy, SciPy, and Pandas have revolutionized the way that

the Python programming language works with numerical data sets, allowing quick and

meaningful analysis of large sets of

numbers with tools such as matrices, arrays,

and dataframes.

NumPy is a fundamental Python library that

adds support for multidimensional arrays

and matrices. This library is used

extensively with other libraries as a

dependency due to its universal applications in mathematics. While NumPy can do many

different things, at its core it excels at creating n-dimensional arrays that are scalable and easy to

access (“What is NumPy”, 2015). These arrays, which the vanilla Python programming language

lacks support for, are widely used for organizing large clumps of data into lists that can be easily

modified. However, the method for creating these specialized arrays does not come from Python

itself - it can be traced back to the programming language C, which has a similar object-oriented

design (2015). Numpy is efficient to use specifically for medical institutions because these

companies oftentimes have a large amount of patient data that they need to store. The specialized

multi-dimensional arrays that NumPy creates can help organize the data in a way that can be

easily read and, most importantly, modified, by other algorithms such as those in SciPy and

Pandas (Paruchuri, 2016).

SciPy, which is built upon the core methods added by NumPy, adds programming

functionality to many of the processes seen in the science, technology, and engineering fields.

The library encompasses all of the advanced mathematical computations that are seen in
Jordan 4

scientific calculations, such as special functions, integration, differential equations, and more

(“Frequently Asked Questions”, n.d.). Like NumPy, SciPy is not actually written in Python. It is

built on a combination of the C and Fortran

programming languages, which help streamline

the algorithm design and, ultimately, make it

faster and more efficient. This high efficiency is

key, which is why it is so popular among machine

learning and data analysis engineers (Akinfaderin,

2017). In the medical field, determining false

positives and false negatives are vitally important. These calculations are typically done by

integrating the Receiver Operating Characteristic (ROC) curve, which requires mathematical

calculations like those provided by SciPy (Tsanas, Little, & McSharry, 2013).

Pandas is another popular data analysis library that is built extensively off of NumPy.

This dataset offers support for organizing large clumps of data into visually appealing datasets

that can be easily modified for almost any use. This aides in the task of performing mathematical

calculations on the values within the dataset as well as presenting the dataset to the programmer

or others. The central addition of Python is the DataFrame, which acts as an object that contains

all of the data in a way that neatly sorts it into a table upon function call. These DataFrames can

be created, spliced, rearranged, cut, appended to, and more with just a quick function call. These

functions are all designed to benefit the person analyzing the data (“Pandas,” n.d.). Pandas,

unlike NumPy and SciPy, has a good portion of its code written in Python (the rest is in C). This

is essential because, while working on DataFrames, there are many transformations happening

between DataFrames and Python lists/arrays in order to execute the called function. This requires
Jordan 5

enough support for Python arrays, which mandates that they be accessed with Python code to

make the process go as quickly and as cleanly as possible. As aforementioned, healthcare

institutions quickly accrue very large patient datasets that they need to store. Pandas is useful

because it would allow data analysts to easily sort the data and extrapolate meaningful data from

it in a way that is useful to the company. Another Python library that is vital to GALILEO is

Scikit-learn.

Scikit-learn is a python library designed exclusively for machine learning. It is packed

full of algorithms devoted to both

unsupervised machine learning as well as

sup

ervi

sed

chi

ne learning. Since Python is an object-oriented

programming language that is widely considered to be adept at data manipulation and efficiency,

the combination of Python with a library such as scikit-learn allows for fast, efficient, and

accurate machine learning, which is quickly becoming more and more necessary in today’s

digitally-enhanced world.

These four Python libraries - NumPy, SciPy, Pandas, and Scikit-Learn - have

revolutionized the way that programmers work with datasets. The algorithms that these libraries

contain can be easily applied to the medical field in order to create useful data that can help save

both money and lives.

Jordan 6

The field of statistics and probability analysis as it relates to computer science has also

revolutionized the way that big data is examined by researchers. Statistical values such as the

mean, range, and standard deviation have allowed researchers to analyze numerical data sets

quickly. The mean, or average, of a dataset is a representation of the central tendency of that

dataset and helps collectively represent a population with one value. The range of a dataset

exhibits the lowest and highest endpoints of a particular set of data points, and helps set up

bounds for a cluster of data points and/or a dataset. The standard deviation of a dataset represents

the dispersion or deviation of a set of values from the group as a whole and helps denote the

accuracy or concentration of a dataset in relation to that dataset’s mean. In addition to these

statistical values, there are also probability analysis techniques to help better understand a dataset

and its values.

Jordan 7

The first major probability theory is the Bayesian model. Unlike other probability

models, the Bayesian model focuses on the reasonably expected. It uses random variables (thus

making it entropy-based) to model sources of uncertainty in a dataset in hopes to predict future

points accurately and efficiently. In order to do this, Bayes’ theorem is used, which says that the

probability of one event, A, being true given that a second event, B, is true is equal to the

probability of the B being true given that A is true multiplied by the probability of A being true

independent of B. This

whole product is then

divided by the probability of

B being true independent of

A. The image to the right

shows a mathematical representation, with the bar (|) representing given. The whole purpose of

Bayesian probability analysis is to test the truthfulness of a specified hypothesis versus the given

data and create a conditional probability value as a result ("LaMorte, 2016, “Baye's Theorem”).

Another major probability theory used extensively in computer-based statistics is the

Gaussian model. The Gaussian model, also called the normal model, is a very common

distribution that happens to be a continuous outcome model. The main point of the model is to

allow statisticians to predict the probability of a certain value occuring in a dataset. This

information is used to create better models of datasets that are applicable to the real world

(LaMorte, 2017, “The Normal Distribution”). The Gaussian model is essentially a continuous

graph. On the x-axis, it graphs the value of the variable that it is measuring. On the y-axis, the

frequency at which that value occurs in a hypothetical dataset is plotted. The Gaussian model is

based off of the fact that the dataset is normal. This means that there is an equal proportion of
Jordan 8

data points both below and above the mean value. The median and the mode are also in the

center of the graph. In a skewed dataset, the graph is weighted towards either the left or right

sides, which causes the mean, median, and mode to be in different parts of the curve.

While libraries such as NumPy, SciPy, Pandas, and Scikit-learn have revolutionized the

way that people handle and predict big data, these libraries are currently only effective with

quantitative data, not categorical or mixed data.

Healthcare providers are getting overwhelmed

with the amount of patients that they need to

treat, and much of the data that they store on the

patient’s behalf is categorical or mixed data.

Hospitals are having a hard time digesting this

data and extrapolating useful information from

it. Without a reliable and efficient method to

organize and analyze large patient datasets with categorical or mixed attributes, the accuracy of

their physicians’ diagnosis will start to falter due to long hours of work. Galileo, the categorical

data analysis tool from the JHU/APL, will help to solve this problem by creating various clusters

of subpopulations based on their attributes and values. This will allow medical personnel to

easily extrapolate extremely useful information from complex datasets, which lowers the time

and cost of diagnosis and care while improving the accuracy and precision. This includes helping

to reduce the chances of a misdiagnosis of a disease, such as a false positive or false negative.

Galileo can revolutionize the way the world handles data, especially in the medical field.

Data Collection/Methods

In order to answer the research question of how to best gather valuable knowledge on
Jordan 9

large patient datasets with categorical and mixed attributes, Galileo, a new unsupervised learning

algorithm developed at the

JHU/APL was used to draw

relationships between patient data

for the diagnostic process. It was hypothesized that Galileo would more effectively draw

relationships and decrease the costs and time involved in the diagnostic process. Galileo uses

entropy-based data metrics in order to cluster datasets into pools of data points that share similar

attributes. A general mixture model is used in order to cluster the data into groups that have

similar data points (Graff, Savkli, Lin, & Kinsey, 2017). This finite mixture model uses

algorithms derived from both Naïve Bayes and Gaussian mixture models, making it a universal

solution for solving complex data problems. This

model is further described by the superposition

of probability distributions for clusters in the

dataset.

Standard and random variables are

present throughout the Galileo, making data

analysis random each time it is run. The k-value determines the starting and maximum number of

clusters in the dataset. Additionally, random points are selected as the seeds for the initial

clusters in Galileo.

Results and Data Analysis

The Mushrooms Dataset

Jordan 10

The Mushrooms Dataset was pulled from the University of California - Irvine (UCI) Machine

Learning Repository. It consisted of 23 attributes (columns) and 8124 rows of data. The dataset

contained different letters describing different characteristics of mushrooms.

Results

When GALILEO was run on the Mushrooms Dataset, the optimal clusters found was 23. This

was the same value for the Akaike, Bayesian, and Density Information Criterion. The

consistency seen in this result is promising. The resulting confusion matrix shows intense but

precise areas of concentration, which denotes distinct clusters with well-defined boundaries.

Additionally, even though it was not included in the dataset in any way, GALILEO found that

there were 23 different species of mushrooms used in the dataset. This shows that the algorithm

really does have potential for unsupervised machine learning on large data sets such as this one.

This was a benchmark test that gave proof of what GALILEO could do.
Jordan 11

The Cleveland Heart Disease Dataset

The Heart Disease Dataset was pulled from the University of California - Irvine (UCI) Machine

Learning Repository. It consisted of 14 attributes (columns) and 313 rows of data. The dataset

contained different words and numbers representing different symptoms of heart disease, like

amount of cholesterol or body parts affected by pain.

Results

The results generated by GALILEO for the Heart Disease Dataset were less than satisfactory.

GALILEO found the best number of clusters to be 2, which is the absolute minimum value

allowed in the program code. Additionally, the resulting confusion matrix generated by Scikit-

learn shows a lot of intense population over a wide spread of area. This denotes that there is a lot
Jordan 12

of blurred lines between clusters, and the algorithm doesn’t really know where to put certain

points. It was concluded that this unsatisfactory result was due to two main issues. The first

problem was the presence of too many unique values in the attribute space. For example, the

“cholesterol” attribute had numbers such as ‘291.4’ and ‘291.5’. GALILEO treats these as

completely different numerical values, and the presence of 200+ unique data points in each

attribute caused GALILEO to fail. Additionally, there were only 313 instances of data in this set,

compared to the 8000+ seen in the Mushrooms dataset. With any type of machine learning, the

more instances of data a set has, the better results that become of it when it is analyzed. These

two problems caused GALILEO to perform very poorly when trying to analyze the Cleveland

Heart Disease Data Set.

The CASP Protein Dataset

The CASP Protein Dataset was pulled from the University of California - Irvine (UCI) Machine

Learning Repository. It consisted of 9 attributes (columns) and 46000 rows of binned data. The

dataset contained different numerical values representing different physicochemical

characteristics of certain species of proteins found in the human body.

Results
Jordan 13

For this dataset, GALILEO performed noticeably better than on the Cleveland Heart Disease

Data Set. This data set was designed to fix the problems found in the Heart Disease data set. It

was a lot larger - around 46000 instances of data were used (it was cut to 10000 due to a memory

overload error). This gave it a lot more instances to work with, which helped the unsupervised

and supervised machine learning processes greatly. Additionally, the data values within the

attributes were binned by groups of numbers (which were dependent on the range of values in

that attribute). In the end, each attribute only had 10 unique values, instead of the 200-300

without binning. This decrease in unique values helped GALILEO connect the dots more and

perform better in the end. The GALILEO algorithm found the best number of clusters to be 23. It

turns out that the original dataset had 21 different species of proteins cataloged (this was NOT

exhibited in the dataset at all). The resulting confusion matrix was a lot better. Even though it

showed some error/spread, it showed a few distinct clusters with pretty well-defined boundaries.

So, GALILEO was not far off in its machine learning estimations, which shows a lot of promise

when it comes to analyzing datasets. However, it is worth noting that this data set was only

quantitative in nature, not mixed or categorical.

Jordan 14

Conclusion

The results from the three experiments clearly show that GALILEO has an enormous

amount of potential in the unsupervised machine learning field. It excels in finding new

information about datasets that wasn’t explicitly stated within the dataset itself. It found that

there were 23 species of mushrooms without it knowing that there were 23 species to begin with.

It nearly found 21 species for the CASP Protein Data Set. Based off these, GALILEO can have

an enormous impact on not just the medical field, but other fields such as archaeology and

economics as well. The possibilities are endless. With that being said, GALILEO does have a

small set of requirements for it to work as intended. The first is that there needs to be either a

small set of unique values for each attribute OR there needs to be binning put into place by the

researcher. This enables connections to be made between clusters in the data set. Secondly, the

dataset must be large enough to gain enough information for the unsupervised machine learning

to actually take place. With more work and testing, GALILEo can be fine tuned to make it more

universally applicable to all kinds of datasets. From there, ti can start making a lasting positive

impact on the field of big data analysis.

Jordan 15

References

Akinfaderin, W. (2017, March 23). The mathematics of machine learning. Retrieved from

https://towardsdatascience.com/the-mathematics-of-machine-learning-894f046c568

Béjar, J. (2014, September). Unsupervised machine learning and data mining. Retrieved

from http://www.cs.upc.edu/~bejar/amlt/material/AMLTTransBook.pdf

Brownlee, J. (2016, May 16). Visualize machine learning data in Python with Pandas.

Retrieved from https://machinelearningmastery.com/visualize-machine-learning-data-

python-pandas/

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large

clusters. Retrieved from http://web.mit.edu/6.033/www/papers/mapreduce-osdi04.pdf

Jordan 16

Dinov, I. D. (2016, March 17). Volume and value of big healthcare data. Retrieved from

HHS Public Access website:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4795481/pdf/nihms-766954.pdf

Dumon, M. (2017, December 14). Ancient accounting systems. Retrieved from

https://www.investopedia.com/articles/financialcareers/09/ancient-accounting.asp

Frequently asked questions. (n.d.). Retrieved from

https://www.scipy.org/scipylib/faq.html

Geron, A. (2017). The fundamentals of machine learning. In Hands-on machine learning

with Scikit-Learn and TensorFlow: Concepts, tools, and techniques to build intelligent

systems (pp. 3-228). Sebastopol, CA: O’Reilly.

Graff, P., Savkli, C., Lin, J., & Kinsey, M. (2017, July 18). GALILEO:

Generalized low-entropy

mixture model [Microsoft Powerpoint].

He, Z., Xu, X., & Deng, S. (2007). Attribute value weighting in k-modes clustering.

Retrieved from Harbin Institute of Technology - Department of Computer Science and

Engineering website: https://arxiv.org/pdf/cs/0701013.pdf

LaMorte, W. W. (2016, July 24). Bayes’s Theorem. Retrieved from The Role of

Probability website: http://sphweb.bumc.bu.edu/otlt/MPH-

Modules/BS/BS704_Probability/BS704_Probability6.html

LaMorte, W. W. (2016, July 24). The normal distribution: A probability model for a

continuous outcome. Retrieved from The Role of Probability website:

http://sphweb.bumc.bu.edu/otlt/MPH-

Modules/BS/BS704_Probability/BS704_Probability8.html
Jordan 17

Mahoney, M. S. (1988). The history of computing in the history of technology. Retrieved

from https://www.princeton.edu/~hos/mike/articles/hcht.pdf

Pandas: A Python data analysis library. (n.d.). Retrieved from

https://pandas.pydata.org/index.html

Paruchuri, V. (2016, October 18). NumPy tutorial: Data analysis with Python. Retrieved

from https://www.dataquest.io/blog/numpy-tutorial-python/

Tsanas, A., Little, M. A., & McSharry, P. E. (2013). [A methodology for the analysis of

medical data]. In J. P. Sturmberg & C. M. Martin (Authors), Handbook of systems and

complexity in health (pp. 113-125). Retrieved from

https://people.maths.ox.ac.uk/tsanas/Preprints/A%20methodology%20for%20the%20anal

ysis%20of%20medical%20data_website.pdf

What is Numpy? (2015). Retrieved from Scipy Foundation website:

https://docs.scipy.org/doc/numpy-1.10.0/user/whatisnumpy.html

Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
From Everand
Python Crash Course: The Complete Step-By-Step Guide On How to Come Up Easily With Your First Data Science Project From Scratch In Less Than 7 Days
Simon Tallman
No ratings yet
Python Libraries and Packages For Data Science
100% (1)
Python Libraries and Packages For Data Science
5 pages
Practical Data Analysis
From Everand
Practical Data Analysis
Hector Cuesta
4.5/5 (14)
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
From Everand
Mastering Data Science with Python: The Ultimate Guide: Unlock the Power of Data Analysis and Visualization with Python's Cutting-Edge Tools and Techniques
daniel Huston
No ratings yet
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
Data Science with Python: Unlocking the Power of Pandas and Numpy
From Everand
Data Science with Python: Unlocking the Power of Pandas and Numpy
Robert Johnson
No ratings yet
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
School of Computing and Creative Media XBIS 2023 Data Science Assignment Report
No ratings yet
School of Computing and Creative Media XBIS 2023 Data Science Assignment Report
21 pages
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
Python for Data Science: A Practical Approach to Machine Learning
From Everand
Python for Data Science: A Practical Approach to Machine Learning
Jarrel E.
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
From Everand
Hands-on Data Analysis and Visualization with Pandas: Engineer, Analyse and Visualize Data, Using Powerful Python Libraries
PURNA CHANDER RAO. KATHULA
5/5 (1)
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies
Timothy Eastridge
No ratings yet
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
From Everand
Graph Data Science with Python and Neo4j: Hands-on Projects on Python and Neo4j Integration for Data Visualization and Analysis Using Graph Data Science for Building Enterprise Strategies (English Edition)
Timothy Eastridge
No ratings yet
Data Science Essentials: Machine Learning and Natural Language Processing
From Everand
Data Science Essentials: Machine Learning and Natural Language Processing
Angel Gabaldon
No ratings yet
Python-2
No ratings yet
Python-2
18 pages
Saurabh mgnm801 Ca2
No ratings yet
Saurabh mgnm801 Ca2
13 pages
Designing Machine Learning Systems with Python
From Everand
Designing Machine Learning Systems with Python
David Julian
No ratings yet
Python For Data Analysis
No ratings yet
Python For Data Analysis
49 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Introduction to Data Science Using R
From Everand
Introduction to Data Science Using R
Prema Alla
No ratings yet
unit 4
No ratings yet
unit 4
105 pages
dsbda Unit4
No ratings yet
dsbda Unit4
110 pages
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
From Everand
Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples
Prateek Gupta
No ratings yet
Data Science Fusion: Integrating Maths, Python, and Machine Learning
From Everand
Data Science Fusion: Integrating Maths, Python, and Machine Learning
NIBEDITA Sahu
No ratings yet
Hands-on NumPy for Numerical Analysis
From Everand
Hands-on NumPy for Numerical Analysis
Rituraj Dixit
No ratings yet
Anurag008python
No ratings yet
Anurag008python
39 pages
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Data Analytics with Python: Data Analytics in Python Using Pandas
From Everand
Data Analytics with Python: Data Analytics in Python Using Pandas
Frank Millstein
3/5 (1)
Python Libraries
No ratings yet
Python Libraries
17 pages
sample project
No ratings yet
sample project
12 pages
Pyhpc2011 Submission 9
No ratings yet
Pyhpc2011 Submission 9
9 pages
Pandas - Data Analysis Paper
No ratings yet
Pandas - Data Analysis Paper
9 pages
Data Science 1
No ratings yet
Data Science 1
3 pages
Data Science Basics
From Everand
Data Science Basics
Zoe Codewell
No ratings yet
Python For Data Science
No ratings yet
Python For Data Science
8 pages
Data Preprocessing and Data Analysis using Python
No ratings yet
Data Preprocessing and Data Analysis using Python
32 pages
Part3 ML
No ratings yet
Part3 ML
201 pages
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
Synthetic Data Generation: A Beginner’s Guide
From Everand
Synthetic Data Generation: A Beginner’s Guide
Robert Johnson
No ratings yet
Python Ca22
No ratings yet
Python Ca22
14 pages
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet
PYTHON
No ratings yet
PYTHON
11 pages
Auditing The Data Using Python
No ratings yet
Auditing The Data Using Python
4 pages
Bernd Klein Python Data Analysis Letter
No ratings yet
Bernd Klein Python Data Analysis Letter
514 pages
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
From Everand
Learn Python: Get Started Now with Our Beginner’s Guide to Coding, Programming, and Understanding Artificial Intelligence in the Fastest-Growing Machine Learning Language
Anthony Adams
5/5 (3)
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
From Everand
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
Younes Hamdani
No ratings yet
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
From Everand
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Neal Fishman
No ratings yet
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
Data Analysis With Pandas
No ratings yet
Data Analysis With Pandas
7 pages
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
From Everand
Data Science Project Ideas for Thesis, Term Paper, and Portfolio
Zemelak Goraga
No ratings yet
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Data Science Unveiled: A Practical Guide to Key Techniques
From Everand
Data Science Unveiled: A Practical Guide to Key Techniques
Ed A Norex
No ratings yet
ERPANET Case Study: Project Gutenberg
From Everand
ERPANET Case Study: Project Gutenberg
ERPANET
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Visualization
No ratings yet
Data Visualization
25 pages
Qaslunchpresentation
No ratings yet
Qaslunchpresentation
25 pages
Seanjordan Philipgraff Posterboard
No ratings yet
Seanjordan Philipgraff Posterboard
1 page
Research Proposal and Timeline of PD
No ratings yet
Research Proposal and Timeline of PD
11 pages
Decemberpresentation
No ratings yet
Decemberpresentation
9 pages
Researchproposal
No ratings yet
Researchproposal
6 pages
Mini Interviews
No ratings yet
Mini Interviews
5 pages
Transcribedinterview
No ratings yet
Transcribedinterview
11 pages
Dataanalysis
No ratings yet
Dataanalysis
2 pages
Synthesispaper
No ratings yet
Synthesispaper
19 pages
Datacollection
No ratings yet
Datacollection
4 pages
4th Quarter Presentation
No ratings yet
4th Quarter Presentation
19 pages
Marketing Presentation
No ratings yet
Marketing Presentation
13 pages
Hypothesisassignment
No ratings yet
Hypothesisassignment
2 pages
Practical aspect of Robot Design, Control and Application of AI
No ratings yet
Practical aspect of Robot Design, Control and Application of AI
68 pages
Get Artificial Intelligence in Urban Planning and Design: Technologies, Implementation, and Impacts 1st Edition Imdat As (Editor) - eBook PDF free all chapters
100% (4)
Get Artificial Intelligence in Urban Planning and Design: Technologies, Implementation, and Impacts 1st Edition Imdat As (Editor) - eBook PDF free all chapters
59 pages
Okok Projects 2023
No ratings yet
Okok Projects 2023
45 pages
Machine and Deep Learning Algorithms and Applications
No ratings yet
Machine and Deep Learning Algorithms and Applications
123 pages
Machine Learning And Deep Learning In Efficacy Improvement Of Healthcare Systems Om Prakash Jena pdf download
100% (1)
Machine Learning And Deep Learning In Efficacy Improvement Of Healthcare Systems Om Prakash Jena pdf download
83 pages
Machine Learning Foundations: Supervised, Unsupervised, and Advanced Learning Taeho Jo - Download the ebook now to never miss important information
100% (2)
Machine Learning Foundations: Supervised, Unsupervised, and Advanced Learning Taeho Jo - Download the ebook now to never miss important information
70 pages
Ai Board Paper
No ratings yet
Ai Board Paper
6 pages
JzA1O0_56. Nasscom Mega Quiz PDF 02_NASSCOM Skill Enhancement Course_class Note PDF
No ratings yet
JzA1O0_56. Nasscom Mega Quiz PDF 02_NASSCOM Skill Enhancement Course_class Note PDF
7 pages
Module 2
No ratings yet
Module 2
28 pages
Soft Computing
No ratings yet
Soft Computing
9 pages
Dav Report
No ratings yet
Dav Report
17 pages
Thesis Seismic Interpretation
100% (3)
Thesis Seismic Interpretation
6 pages
(Ebook) Deep Generative Modeling by Jakub M. Tomczak ISBN 9783030931575, 3030931579 2024 Scribd Download
100% (8)
(Ebook) Deep Generative Modeling by Jakub M. Tomczak ISBN 9783030931575, 3030931579 2024 Scribd Download
81 pages
Zeroshot Fewshot (Concepts)
No ratings yet
Zeroshot Fewshot (Concepts)
5 pages
Unit-5 DS Notes
No ratings yet
Unit-5 DS Notes
19 pages
Part B Unit 2 AI Project Cycle
No ratings yet
Part B Unit 2 AI Project Cycle
25 pages
Machine Learning Foundations: Supervised, Unsupervised, and Advanced Learning Taeho Jo instant download
No ratings yet
Machine Learning Foundations: Supervised, Unsupervised, and Advanced Learning Taeho Jo instant download
55 pages
BigData&Analytics Module5
No ratings yet
BigData&Analytics Module5
21 pages
Understanding Deep Learning
No ratings yet
Understanding Deep Learning
782 pages
IoT module 5 Notes
No ratings yet
IoT module 5 Notes
6 pages
6 - CSE3013 - Learning Systems
No ratings yet
6 - CSE3013 - Learning Systems
42 pages
ML Lab Manual CSE (1)
No ratings yet
ML Lab Manual CSE (1)
50 pages
Unit III
No ratings yet
Unit III
19 pages
Theobald O. Machine Learning With Python 2024
No ratings yet
Theobald O. Machine Learning With Python 2024
146 pages
Lec 1: DNSC 6314
No ratings yet
Lec 1: DNSC 6314
47 pages
Aiml Mca
100% (1)
Aiml Mca
38 pages
Final Doc Genomics
No ratings yet
Final Doc Genomics
30 pages
Ml Lab Manual(Vim)
No ratings yet
Ml Lab Manual(Vim)
13 pages
Sat - 90.Pdf - Prediction of Bank Customer Churn Using Machine Learning Technique
No ratings yet
Sat - 90.Pdf - Prediction of Bank Customer Churn Using Machine Learning Technique
11 pages
7th_Semester_Progress_presentation
No ratings yet
7th_Semester_Progress_presentation
16 pages