0% found this document useful (0 votes)
20 views

Introduction-It Skills

matplotlib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Introduction-It Skills

matplotlib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Why Reporting & Analyzing Data?

• The amount of data stored in databases is growing exponentially & databases


are now measured in gigabytes(GBs) and terabytes(TBs).
• However row data does not provide useful information.
• In today’s highly competitive business environment, companies need to turn
these terabytes of raw data into some useful information.
• The general methods of analysis/reporting can be broadly classified into two
categories: non-parametric analysis & parametric analysis
• Example 2

• Managers will generally be more interested in actual data and non-parametric analysis results,
while engineers will be more concerned with parametric analysis.
3
What is Business Intelligence?
• BI technologies provide historical, current and predictive views of business
operations.
• Common functions of business intelligence technologies include reporting, online
analytical processing, analytics, data mining, process mining, business
performance management, text mining, predictive analytics and prescriptive
analytics.
• BI technologies can handle large amounts of structured and sometimes
unstructured data to help business & also identify, develop new strategic
business opportunities.
• Identifying new opportunities and implementing an effective strategy based
4

on insights can provide businesses with a competitive market advantage and


long-term stability.
Business Intelligence (Cont..)
• Business intelligence (BI) make up the strategies and technologies used by
enterprises for the data analysis of business information.
• BI tools access and analyze data sets and present analytical findings in reports,
summaries, dashboards, graphs, charts and maps to provide users with
detailed intelligence about the state of the business.
• Typical BI infrastructure components are as follows:
• Software solution for gathering, cleansing, integrating, analyzing and sharing data.
5
• It produces analysis and provides believable information to help making
effective and high quality business decisions.
Business Intelligence (Cont..)
▪ The most common kinds of business intelligence systems are:
• MIS - Management Information Systems

• CRM - Customer Relationship Management

• EIS - Executive Information Systems

• DSS - Decision Support Systems

• GIS - Geographic Information Systems

• OLAP - Online Analytical Processing


6
Core competencies of a data scientist
The Data Scientist requires knowledge of vast range of skills to perform required
tasks.
Most of the times data scientists work in a team to provide the best results,
○ for example someone who is good at gathering data might team up with an analyst and some gifted in
presenting information.
○ It would be hard to find a single person with all the required skills.
Below are the areas in which a data scientist could find opportunity
○ Data Capture :
■ Managing data source (i.e. database, exel, pdf, text etc…)
■ Converting the unstructured data to structured data.
○ Analysis :
■ Knowledge of basic statistical tools.
■ Use of specialized math tricks and algorithms.
○ Presentations :
■ Provide graphical presentations of the pattern.
■ Represent the results of the data analysis to the end users.
Creating the Data Science Pipeline
Data science pipeline requires the data scientist to follow particular steps in the preparation, analysis and
presentation of the data.

General steps in the pipeline are


○ Preparing the data
■ The data we access from various sources may not come directly in the structured format.
■ We need to transform the data in the structured format.
■ Transformation may require changing data types, order in which data appears and even the creation
of missing data
○ Performing data analysis
■ Results of the data analysis should be provable and consistent.
■ Some time single approach may not provide the desired output, we need to use multiple algorithms
to get the result.
■ The use of trial and error is part of the data science art.
○ Learning from data
■ As we iterate through various statistical analysis methods and apply algorithms to detect patterns, we
begin learning from the data.
■ The data might not tell the story that you originally thought it would.
○ Visualizing
○ Obtaining insights
Why Python?
Python is the vision of a single person, Guido van Rossum, Guido started the language in
December 1989 as a replacement for the ABC language.
However, it far exceeds the ability to create applications of all types, and in contrast to
ABC, boasts four programming styles (programming paradigms)
○ Functional :
■ Treats every statements as a mathematical equation and avoids any form of state or mutable data
■ The main advantage of this approach is having no side effects to consider.
■ This coding style lends itself better than the others to parallel processing because there is no state to consider.
■ Many developers prefer this coding style for recursion and for lambda calculus.
○ Imperative :
■ Performs computations as a direct change to program state.
■ This style is especially useful when manipulating data structures and produces elegant but simple code.
○ Object-oriented :
■ Relies on data fields that are treated as objects and manipulated only through prescribed methods.
■ Python doesn’t fully support this coding form because it can’t implement features such as data hiding.
■ This is useful coding style for complex applications because it supports encapsulation and polymorphism.
○ Procedural :
■ Treats tasks as step-by-step iterations where common tasks are placed in functions that are called as needed.
Understanding Python's Role in Data Science
Python has a unique attribute and is easy to use when it comes to quantitative and
analytical computing
Data Science Python is widely used and is a favorite tool along being a flexible and
open sourced language.
Its massive libraries are used for data manipulation and are very easy to learn even
for a beginner data analyst.
Apart from being an independent platform it also easily integrates with any existing
infrastructure which can be used to solve the most complex problems.
Python is preferred over other data science tools because of following features,
○ Powerful and Easy to use
○ Open Source
○ Choice of Libraries
○ Flexibility
○ Visualization and Graphics
○ Well supported
Considering Speed of Execution
Analysis takes considerable processing power.
The dataset are so large that you can bog down even an incredibly powerful system.
Following factors control the speed of execution for data science application
○ Dataset Size
○ Loading Technique
○ Coding Style
○ Machine capabilities
○ Analysis Algorithm
Using the Python Ecosystem for Data Analytics
We need to load certain libraries in order to perform specific data science task in
python.
Following are the list of libraries which are used in data analytics.
1. Performing fundamental scientific computing using NumPy
2. Performing data analysis using pandas
3. Plotting the data using matplotlib
4. Accessing scientific tools using SciPy
5. Implementing machine learning using Scikit-learn
6. Going for deep learning with Keras and TensorFlow
7. Creating graphs with NetworkX
8. Parsing HTML documents using Beautiful Soup
1) NumPy
NumPy is used to perform fundamental scientific computing.
NumPy library provides the means for performing n-dimensional array manipulation,
which is critical for data science work.
NumPy provides functions that include support for linear algebra, Fourier
transformation, random-number generation and many more..
Explore listing of functions at https://numpy.org/doc/stable/reference/routines.html
2) pandas
pandas is a fast, powerful, flexible and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.
it offers data structures and operations for manipulating numerical tables and time
series.
The library is optimized to perform data science tasks especially fast and efficiently.
The basic principle behind pandas is to provide data analysis and modelling support
for Python that is similar to other languages such as R.
3) matplotlib
The matplotlib library gives a MATLAB like interface for creating data presentations
of the analysis.
The library is initially limited to 2-D output, but it still provide means to express
analysis graphically.
Without this library we can not create output that people outside the data science
community could easily understand.
4) SciPy
The SciPy stack contains a host of other libraries that we can also download
separately.
These libraries provide support for mathematics, science and engineering.
When we obtain SciPy, we get a set of libraries designed to work together to create
applications of various sorts, these libraries are
○ NumPy
○ Pandas
○ matplotlib
○ Jupiter
○ Sympy
○ Etc…..
5) Scikit-learn
The Scikit-learn library is one of many Scikit libraries that build on the capabilities
provided by NumPy and SciPy to allow Python developers to perform domain
specific tasks.
Scikit-learn library focuses on data mining and data analysis, it provides access to
following sort of functionality:
○ Classification
○ Regression
○ Clustering
○ Dimensionality reduction
○ Model selection
○ Pre-processing
6) Keras and TensorFlow
Keras is an application programming interface (API) that is used to train deep
learning models.
An API often specifies a model for doing something, but it doesn’t provide an
implementation.
TensorFlow is an implementation for the keras, there are many other
implementations for the keras like
○ Microsoft’s cognitive Toolkit, CNKT
○ Theano
7) NetworkX
NetworkX is a Python package for the creation, manipulation, and study of the
structure, dynamics, and functions of complex networks (For example GPS setup to
discover routes through city streets).
NetworkX also provides the means to output the resulting analysis in a form that
humans understand.
Main advantage of using NetworkX is that nodes can be anything (including images)
and edges can hold arbitrary data.
8) Beautiful Soup
Beautiful Soup is a Python package for parsing HTML and XML documents.
It creates a parse tree for parsed pages that can be used to extract data from HTML,
which is useful for web scraping.

You might also like