Introduction Data Science
Introduction Data Science
Hendro Margono
Data science
• Data Science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights from
structured and unstructured data.
• Data science is the same concept as data mining and big data: "use the most
powerful hardware, the most powerful programming systems, and the most
efficient algorithms to solve problems
• Data science is a "concept to unify statistics, data analysis, machine learning
and their related methods" in order to "understand and analyze actual
phenomena" with data.
• It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science
Big Data
• Big data is a term applied to data sets whose size or type is beyond
the ability of traditional relational databases to capture, manage and
process the data with low latency.
• Big data has one or more of the following characteristics: high
volume, high velocity or high variety. Artificial intelligence (AI),
mobile, social and the Internet of Things (IoT) are driving data
complexity through new forms and sources of data.
• For example, big data comes from sensors, devices, video/audio,
networks, log files, transactional applications, web, and social media
— much of it generated in real time and at a very large scale.
Big Data
• Big data analytics is the use of advanced analytic techniques against very
large, diverse data sets that include structured, semi-structured and
unstructured data, from different sources, and in different sizes from
terabytes to zettabytes.
• Analysis of big data allows analysts, researchers and business users to
make better and faster decisions using data that was previously
inaccessible or unusable.
• Businesses can use advanced analytics techniques such as text analytics,
machine learning, predictive analytics, data mining, statistics and natural
language processing to gain new insights from previously untapped data
sources independently or together with existing enterprise data.
Data Mining
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
• Data mining: a misnomer?
• Alternative names
• Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Why Data Mining ?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
Why Data Mining ?
in Social Media
Mining Term
in Journals
Data Mining Techniques
24
Association Rule Discovery
• Marketing and Sales Promotion:
• Let the rule discovered be
{Bagels, … } --> {Potato Chips}
• Potato Chips as consequent => Can be used to determine what
should be done to boost its sales.
• Bagels in the antecedent => can be used to see which products
would be affected if the store discontinues selling bagels.
• Bagels in antecedent and Potato chips in consequent => Can
be used to see what products should be sold with Bagels to
promote sale of Potato chips!
• Supermarket shelf management.
• Inventory Managemnt
Clustering
income
education
age
26
K-Means Clustering
27
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of the values
of other attributes.
• Goal: previously unseen records should be assigned a class
as accurately as possible.
• A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Decision Trees
Example:
• Conducted survey to see what customers were
interested in new model car
• Want to select customers for advertising campaign
29