0% found this document useful (0 votes)
408 views26 pages

Unit 1 - Data Scientist Tool Box

Data science

Uploaded by

BhaskarHosur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
408 views26 pages

Unit 1 - Data Scientist Tool Box

Data science

Uploaded by

BhaskarHosur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT – 1 DATA TO

ACTIONABLE KNOWLEDGE
BY TOOLS OF DATA SCIENCE
- BHASKAR H S
WHY IS DATA IMPORTANT?
 Data helps you make better decisions: data is synonymous with
knowledge and when you use data to back your decisions, you avoid
assumptions, mistakes, and bias, helping your decisions be better overall.
 Data helps you avoid problems later on: if you’re constantly collecting
data, you’ll be able to monitor how things are working and solve issues on
the fly while they’re still minor instead of waiting for them to become
major.
 Data helps back you up: if you want to propose a change or adjustment
in your company, you’ll have to explain why. And there’s no better way to
prove a point than with numbers and data that clearly back your ideas up.
 Data helps you achieve your goals: the best designed strategies are
ones that have data behind them to properly evaluate the success of the
strategy; by using data, you’ll be able to clearly see what’s working, what
needs tweaking, and what isn’t working.
 Data allows you to be strategic: with clear answers as to what’s
working, where your money is going, and what clients are liking, you’ll be
able to be more strategic with your planning and decision-making, saving
time and resources across the board.
WHAT IS RAW DATA?
 Understanding the process of turning raw data into actionable
insights is only possible if you fully comprehend the
concept of raw data and what it is.
 Raw data is the data you’ve collected before you start
cleaning, analyzing, or organizing it. It refers to the entire set
of data, regardless of if it’s been collected from various
sources, and can take practically any form: databases,
spreadsheets, images, videos, survey results, and more.
 Although it may seem like any data could prove to be useful in
some capacity, raw data isn’t a random compilation of
information.
 On the contrary, skilled data professionals know how to collect
raw data that will be useful later on.
WHY ARE ACTIONABLE
INSIGHTS SO IMPORTANT?
 Actionable insights allow you to better understand your clients: at the
end of the day, you’re trying to sell a product or service to a customer and as
competition grows practically daily, unhappy customers will simply turn to the
next option. With actionable insights as to why clients are leaving your company
and looking to fill their needs elsewhere, you’ll be able to address those specific
problems and hopefully convince more clients to remain loyal in the future.
 Actionable insights help you stay ahead of the competition: we
mentioned competition is popping up left right and center and it’s true–that’s
why you need to be on top of what your competitors are doing and make sure
you’re offering comparable services or better alternatives to help foster customer
loyalty.
 Actionable insights help you grow: if you don’t know where your problem
areas are or what actions you’re taking that are working well, how will you
improve your business strategy? There are so many different moving parts in a
company that it can be almost impossible to truly know what is working well and
what needs to be adjusted. With actionable insights, you’ll have that answer
clearly defined.
DATA TO ACTIONABLE
INSIGHTS
THE MAIN STAGES OF
TURNING DATA INTO
ACTIONABLE INSIGHTS
 Data collection to determine what data is required for analysis and setting
up of data collection processes.
 Data cleaning and processing to clean the data from errors, missing values,
and inappropriate data.
 Data storage systems.
 Data visualization tools like graphs or charts to make the data clearer.
 Data analysis using statistical methods, machine learning, research
methods, etc.
 Decision-making, based on analytical information, for making decisions and
recommendations.
TURNING DATA INTO
ACTIONABLE INSIGHTS
 Step 1: Define Your Goals. ...
 Step 2: Define What You Need to Know. ...
 Step 3: Create a Data Map. ...
 Step 4: Store and Process the Data. ...
 Step 5: Translating, Visualising and Reporting. ...
 Step 6: Turn Insight into Action.
DATA COLLECTION
 When beginning your data collection process, make sure you follow these
three steps:
 Clearly define your ideal outcome: what are you looking to achieve
with your data analysis? Knowing what your goals are will help you
organize your data collection process properly and ensure your data set is
actually useful.
 Choose your data: If you’re looking for business statistics, you’ll probably
need to turn to financial reports or market research for valuable data. On
the other hand, if you’re looking to improve the overall client experience,
customer surveys may be your best bet.
 Collect your data: with your goals and methods determined, get started
collecting your data. You’ll probably end up with loads of data, but you’ll be
able to sift through the important points later on in the data analysis
process.
TURNING RAW DATA INTO CLEAN DATA
 To make your data set useful, you’ll need to follow a few steps to make sure there are
no errors or issues that will make your outcomes incorrect.
 Prepare the data: during this initial step, you’ll check for any errors or invalid values
and ensure that all data is in the same format (if you’ve collected data from various
sources, it may take a bit to unify all values).
 Translate the data: Here, data translation means making sure it’s readable for a
machine to process it. If your data was collected online, you’ll have an easier time
than if it’s manually collected, but double check your file format is correct before
diving right in.
 Process the data: the data will go through various machine learning algorithms that
are specifically instructed on how to make the most from the data. Here, patterns,
trends, relationships, and problem areas will be highlighted.
 Visualize the data: you’ll be able to organize and display your clean,
understandable data set in a variety of formats. Think about what you’re trying to
portray and make sure you pick the visualization method that is right for your exact
situation.
 Store the data: you need to respect local and international privacy regulations,
properly storing and securing the data you used in your analysis. Ensure your
company’s storage policies are in line with industry standards and also explore cloud
storage options.
WHAT IS DATA
ANALYTICS?
 Data analytics, also known as data analysis, is a
crucial component of modern business
operations. It involves examining datasets to
uncover useful information that can be used to make
informed decisions.
 This process is used across industries to optimize
performance, improve decision-making, and gain a
competitive edge.
DATA ANALYTICS
 Data Analytics is a systematic approach that transforms raw
data into valuable insights.
 This process encompasses a suite of technologies and tools
that facilitate data collection, cleaning, transformation, and
modelling, ultimately yielding actionable information.
 This information serves as a robust support system for
decision-making. Data analysis plays a pivotal role in
business growth and performance optimization.
 It aids in enhancing decision-making processes, bolstering risk
management strategies, and enriching customer experiences.
By presenting statistical summaries, data analytics provides a
concise overview of quantitative data.
POPULAR TOOLS FOR DATA ANALYTICS

 Data Analytics is an important aspect of many organizations


nowadays. Real-time data analytics is essential for the success
of a major organization and helps drive decision making.
 There are myriads of data analytics tools that help us get
important information from the given data. We can use some
of these free and open source tools even without any coding
knowledge. These tools are used for deriving useful insights
from the given data without too much of an effort. For
example, you could use them to determine the better among
some cricket player based on various statistics and yardsticks.
They have helped in strengthening the decision making the
process by providing useful information that can help reach
better conclusions.
SOME OF THE MOST
POPULAR TOOLS ARE:
 SAS
 Microsoft Excel
R
 Python
 Tableau
 RapidMiner
 KNIME
 Git
 GitHub
SAS
 SAS was a programming language developed by the SAS
Institute for performed advanced analytics, multivariate
analyses, business intelligence, data management and
predictive analytics.
 It is proprietary software written in C and its software suite
contains more than 200 components.
 Its programming language is considered to be high level thus
making it easier to learn. However, SAS was developed for
very specific uses and powerful tools are not added every day
to the extensive already existing collection thus making it less
scalable for certain applications. It, however, boasts of the fact
that it can analyze data from various sources and can also
write the results directly into an excel spreadsheet.
MICROSOFT EXCEL
 It is an important spreadsheet application that can be useful for recording
expenses, charting data and performing easy manipulation and lookup and
or generating pivot tables to provide the desired summarized reports of
large datasets that contain significant data findings.
 It is written in C#, C++ and .NET Framework and its stable version were
released in 2016. It involves the use of a macro programming language
called Visual Basic for developing applications.
 It has various built-in functions to satisfy the various statistical, financial
and engineering needs. It is the industry standard for spreadsheet
applications.
 It is also used by companies to perform real-time manipulation of data
collected from external sources such as stock market feeds and perform
the updates in real-time to maintain a consistent view of data.
 It is relatively useful for performing somewhat complex analyses of data
when compared to other tools such as R or python. It is a common tool
among financial analysts and sales managers to solve complex business
problems.
R
 It is one of the leading programming languages for performing complex
statistical computations and graphics.
 It is a free and open-source language that can be run on various UNIX
platforms, Windows and MacOS.
 It also has a command line interface which is easy to use. However, it is
very useful for building statistical software and is very useful for performing
complex analyses.
 It has more than 11, 000 packages and we can browse the packages
category-wise. These packages can also be assembled with Big Data, the
catalyst which has transformed various organization’s views on
unstructured data.
 The tools required to install the packages as per user requirements are also
provided by R which makes setting up convenient.
PYTHON
 It is a powerful high-level programming language that is used for general
purpose programming.
 Python supports both structured and functional programming methods. It’s
an extensive collection of libraries make it very useful in data analysis.
 Knowledge of Tensorflow, Theano, Keras, Matplotlib, Scikit-learn and Keras
can get you a lot closer towards your dream of becoming a machine
learning engineer. Everything in python is an object and this attribute
makes it highly popular among developers.
 It is easy to learn compared to R and can be assembled onto any platform
such as MongoDB or SQL server. It is very useful for big data analysis and
can also be used to extract data from the web. It can also handle text data
very well.
 Python can be assembled on various platforms such as SQL Server,
MongoDB database or JSON(JavaScript Object Notation). Some of the
companies that use Python for data analytics include Instagram, Facebook,
Spotify and Amazon.
TABLEAU
 Tableau Public is free software developed by the public company “Tableau
Software” that allows users to connect to any spreadsheet or file and
create interactive data visualizations.
 It can also be used to create maps, dashboards along with real-time
updation for easy presentation on the web.
 The results can be shared through social media sites or directly with the
client making it very convenient to use. The resultant files can also be
downloaded in different formats.
 This software can connect to any type of data source, be it a data
warehouse or an Excel application or some sort of web-based data.
 Approximately 446 companies use this software for operational purposes
and some of the companies that are currently using this software include
SoFi, The Sentinel and Visa.
RAPIDMINER
 RapidMiner is an extremely versatile data science platform
developed by “RapidMiner Inc”. The software emphasizes
lightning fast data science capabilities and provides an
integrated environment for preparation of data and application
of machine learning, deep learning, text mining and predictive
analytical techniques.
 It can also work with many data source types including Access,
SQL, Excel, Tera data, Sybase, Oracle, MySQL and Dbase. Here
we can control the data sets and formats for predictive
analysis.
 Approximately 774 companies use RapidMiner and most of
these are US-based. Some of the esteemed companies on that
list include the Boston Consulting Group and Dominos Pizza
Inc.
KNIME
 Knime, the Konstanz Information Miner is a free and open-
source data analytics software.
 It is also used as a reporting and integration platform. It
involves the integration of various components for Machine
Learning and data mining through the modular data-pipe
lining.
 It is written in Java and developed by KNIME.com AG. It can be
operated in various operating systems such as Linux, OS X and
Windows.
 More than 500 companies are currently using this software for
operational purposes and some of them include Aptus Data
Labs and Continental AG.
GIT FOR DATA SCIENCE
 Git is a version control system designed to track changes in a
source code over time.
 When many people work on the same project without a
version control system it's total chaos.
 Resolving the eventual conflicts becomes impossible as none
has kept track of their changes and it becomes very hard to
merge them into a single central truth. Git and higher-level
services built on top of it (like Github) offer tools to overcome
this problem.
 Usually, there is a single central repository (called "origin" or
"remote") which the individual users will clone to their local
machine (called "local" or "clone"). Once the users have saved
meaningful work (called "commits"), they will send it back
("push" and "merge") to the central repository.
WHAT IS VERSION CONTROL?
 Version control, also known as source control, is the practice of
tracking and managing changes to software code.
 Version control systems are software tools that help software
teams manage changes to source code over time. As
development environments have accelerated, version control
systems help software teams work faster and smarter. They are
especially useful for DevOps teams since they help them to
reduce development time and increase successful deployments.
 Version control software keeps track of every modification to the
code in a special kind of database. If a mistake is made,
developers can turn back the clock and compare earlier
versions of the code to help fix the mistake while minimizing
disruption to all team members.
WHAT IS THE DIFFERENCE
BETWEEN GIT & GITHUB?
 Git is the underlying technology and its command-
line client (CLI) for tracking and merging changes in
a source code.

 GitHub is a web platform built on top of git


technology to make it easier.
 It also offers additional features like user
management, pull requests, automation. Other
alternatives are for example GitLab and Sourcetree.
TERMINOLOGY
 Repository - "Database" of all the branches and commits of a single project
 Branch - Alternative state or line of development for a repository.
 Merge - Merging two (or more) branches into a single branch, single truth.
 Clone - Creating a local copy of the remote repository.
 Origin - Common alias for the remote repository which the local clone was created
from
 Main / Master - Common name for the root branch, which is the central source of
truth.
 Stage - Choosing which files will be part of the new commit
 Commit - A saved snapshot of staged changes made to the file(s) in the repository.
 HEAD - Shorthand for the current commit your local repository is currently on.
 Push - Pushing means sending your changes to the remote repository for everyone to
see
 Pull - Pulling means getting everybody else's changes to your local repository
 Pull Request - Mechanism to review & approve your changes before merging to
main/master
MARKDOWN
 Markdown is a text-to-HTML conversion tool for web
writers.
 Markdown allows you to write using an easy-to-read,
easy-to-write plain text format, then convert it to
structurally valid XHTML (or HTML).
 Thus, “Markdown” is two things:
 (1) a plain text formatting syntax; and
 (2) a software tool, written in Perl, that converts the
plain text formatting to HTML

You might also like