2.data Science Tools
2.data Science Tools
Dr.M.Dhurgadevi
Associate Professor
Sri Krishna College of Technology
Coimbatore
Data science tools
Data science tools are used for diving into
raw and complicated data (unstructured or
structured data) and processing,
extracting, and analyzing it to dig out
valuable insights by applying different
data processing techniques such as
statistics, computer science, predictive
modeling and analysis, and deep learning.
Statistical analyzing techniques
Probability and Statistics
Distribution
Regression analysis
Descriptive statistics
Inferential statistics
Non-Parametric statistics
Hypothesis testing
Linear Regression
Logistic Regression
Neural Networks
K-Means clustering
Decision Trees
1. Data Collection Tools
Semantria
Semantria is a cloud-based tool that extracts data and
information through analyzing the text and sentiments in it. It
is a high-end NLP (neuro-linguistic programming) based tool
that can detect the sentiments on specific elements based on
the language used in it (sounds like magic? No, it is science!).
Trackur
It is yet another tool that collects data, especially on social
media platforms, by tracking the feedback on brands and
products. It also works on sentiment analysis. It is a tool used
for monitoring and can be of great value for marketing
companies.
Today, many other apps use similar text /semantics analysis
and content management, e.g., Open Text, Opinion Crawl.
2. Data Storage Tools
These tools are used to store a huge amount of data – which is typically
stored in shared computers – and interact with it. These tools provide a
platform to unite servers so that data can be assessed easily.
Apache Hadoop
It is a framework for software that deals with huge data volume and its
computation. It provides a layered structure to distribute the storage of data
among clusters of computers for easy data processing of big data.
Apache Cassandra
This tool is free and an open-source platform. It uses SQL and CSL
(Cassandra structure language) to communicate with the database. It can
provide swift availability of data stored on various servers.
Mongo DB
It is a database that is document-oriented and also free to use. It is available
on multiple platforms like Windows, Solaris, and Linux. It is very easy to
learn and is reliable.
Similar data storage platforms are CouchDB, Apache Ignite, and Oracle
NoSQL Database.
3. Data Extraction Tools
Data extraction tools are also known as web scraping tools. They
are automated and extract information and data automatically from
websites. The following tools can be used for data extraction.
OctoParse
It is a web scraping tool available in both free and paid versions. It
gives data as output in structured spreadsheets, which are readable
and easy to use for further operations on it. It can extract phone
numbers, IP addresses, and email IDs along with different data from
the websites.
Content Grabber
It is also a web scraping tool but comes with advanced skills such as
debugging and error handling. It can extract data from almost every
website and provide structured data as output in user preferred
formats.
Similar tools are Mozenda, Pentaho, and import.io.
4. Data Cleaning / Refining Tools
Integrated with databases, data cleaning tools are time-saving and
reduce the time consumption by searching, sorting, and filtering
data to be used by the data analysts. The refined data becomes easy
to use and is relevant. (Blei and Smyth, 2017)
Data Cleaner
Data cleaner works with the Hadoop database and is a very
powerful data indexing tool. It improves the quality of data by
removing duplicates and transforming them into one record. It can
also find missing patterns and a specific data group.
OpenRefine
This refining tool deals with tangled data. It cleans before
transforming it into another form. It provides data access with
speed and ease.
Similar data cleaning tools are MapReduce, Rapidminer, and
Talend.
5. Data Analysis Tools
Data analysis tools not only analyze the data but also perform certain operations on
the data. These tools inspect the data and study data modeling to draw useful
information out of the data, which is conclusive and helps in decision-making for a
certain problem or query.
R
The R programming language is the widely used programming language that is used
by software engineers to develop software that helps in statistical computing and
graphics too. It supports various platforms like Windows, Mac operating system, and
Linux. It is widely used by data analysts, statisticians, and researchers.
Apache Spark
Apache Spark is a powerful analytical engine that provides real-time analysis and
processes data along with enabling mini and micro-batches and streaming. It is
productive as it provides workflows that are highly interactive.
Python
Python has been a very powerful and high-level programming language that has been
around for quite a while. It was used for application development, but now it has been
upgraded with new tools to be used, especially with data science. It gives output files
that can be saved as CSV formats and used as spreadsheets.
Similar data analysis tools are Apache storm, SAS, Flink, Hive, etc..
6. Data Visualization Tools
Data visualization tools are used to present data in a graphical representation for clear insight. Many
visualization tools are a combination of previous functions we discussed and can also support data extraction
and analysis along with visualization.
Python
Python, as mentioned above, is a powerful and general-purpose programming language that also provides
data visualization. It is packed with vast graphical libraries to support the graphical representation of a wide
variety of data.
Tableau
Having a very large consumer market, Tableau is referred to as the grandmaster of all visualization software
by Forbes. It is open-source software that can be integrated with the database, is easy to use, and furnishes
interactive data visualization in the form of bars, charts, and maps.
Orange
Orange also happens to be an open-source data visualization tool supporting data extraction, data analysis,
and machine learning. It does not require programming but rather has an interactive and user-friendly
graphical user interface that displays the data in the form of bar charts, networks, heat maps, scatter plots, and
trees.
Google Fusion Table
It is a web service powered by Google, which can be easily used by non-programmers for collecting data. You
can upload your data in the form of CSV files and save them too. It looks more like an excel spreadsheet and
allows editing by which you can see real-time changes in visualizations. It displays data in the form of pie
charts, bars, timelines, line plots, and scatter plots. It allows you to link the data tables to your websites. You
can also create a map based on your data, which can be further modified by coloring and can also be shared.
Similar popular data visualization apps and tools are DataWrapper, Qlik, and Gephi, which are all open
source and also support CSV files as data input.
Data Scientist-Key tools