Data Science-New (Unit-I)
Data Science-New (Unit-I)
Data Science, Big Data, Data, and the Data Science process.
Introduction
Data Scientists have the ability to find patterns and insights in oceans of data, akin to an
astronomer looking out into the deep space with telescopes to find new planets and galaxies and
black holes in the midst of billions of stars and other galaxies. Data Science, fundamentally like
science, is used to answer questions about the world by combining different fields, i.e. Mathematics,
Computer Science, Philosophy, etc., along with distinct methodologies and novel technology to
augment and enhance our ability to answer them well.
What is Data?
Data is the ingredient for data science processes, and by understand what data is can help us
to be more efficient and appreciate what data science is all about.
1. According to the Cambridge English Dictionary: Information, especially facts or
numbers, collected to be examined and considered and used to help decision-making.
2. According to Wikipedia: A set of values of qualitative or quantitative variables.
Based on the definition by Wikipedia, data can be broken down into the terms set, variables,
qualitative and quantitative.
Set
The population from which data is drawn
Variables
Input variable (X, predictor, independent variable)
Output variable (Y, response, dependent variable)
Quantitative
information about quantity (can be counted and measured)
age, height, weight, number of cases, etc.
Qualitative
descriptive variables (can be observed but not measured)
color, blood type, infected or not, address, etc.
Example
Taking the COVID-19 pandemic as an example, let’s say we want to visualize the number of
confirmed cases in the US with a simple scatter plot,
set - the confirmed cases of the United States.
Independent variable, X - time (days)
Dependent variable, Y - the number of confirmed
Both X and Y are quantitative variables
The result of the plot can also be used to depict the relationship between X and Y, either a
positive or negative correlation. With the use of statistical learning techniques, algorithms such as
linear regression can be used to build models for predictions and inference purposes.
Sources of Data
Data comes from many places, especially in this time where smartphone usage has
dramatically increased due to social media, and the rise of streaming services such as Netflix and
Spotify. Data can be categorized into internal or external, where internal is information generated
within a business, such as finance, and external is information from the customer, usage analytics,
etc. Good data is also often hard to find, in most cases, we’ll have to mine it from the internet to
perform analyses, and lots of cleaning is required for it to be useful.
As Einstein puts it: “If I had an hour to solve a problem and my life depended on the solution,
I would spend the first 55 minutes determining the proper question to ask… for once I know the
proper question, I could solve the problem in less than five minutes.”
To summarize:
1. Formulating the question
2. Data Collection
3. Data Cleaning
4. Data Analysis and Exploration
5. Data Modelling
6. Communicating results
What do Data Scientists do?
Data Scientists have far-reaching applications in multitudinous areas. But to give have an
idea of how Data Science contributes to society, let’s take a look at Nate Silver, the founder and
editor in chief of FiveThirtyEight, who uses statistical analysis to convey compelling stories and
topics mainly about elections, politics, sports, science, economics, etc.
One of his most notable work was on the 2016 election, which produced accurate predictions
based on statistical techniques. We can also check out the website’s forecast on the 2020 Election
for the United States.
One important lesson from that example is that data science tools and methodologies do
merely increases efficacy and speed, but to really make use of it requires the knowledge and ability
to choose the right factors, while eliminating the wrong ones, to produce the right conclusion. And
to do so requires one to ask the right questions in the first place. This is something we have to
understand if we’re deciding to go into this field.
Data Engineer
Data engineers are becoming more important in the age of big data, and can be thought of
as a type of data architect. They are less concerned with statistics, analytics, and modelling as their
data scientist/analyst counterparts, and are much more concerned with data architecture, computing
and data storage infrastructure, data flow, and so on.
The data used by data scientists and big data applications often come from multiple sources,
and must be extracted, moved, transformed, integrated, and stored (e.g., ETL/ELT) in a way that’s
optimized for analytics, business intelligence, and modelling. Data engineers are therefore
responsible for data architecture, and for setting up the required infrastructure. As such, they need
to be competent programmers with skills very similar to someone in a DevOps role, and with strong
data query writing skills as well. Another key aspect of this role is database design (RDBMS,
NoSQL, and NewSQL), data warehousing, and setting up a data lake. This means that they must be
very familiar with many of the available database technologies and management systems, including
those associated with big data (e.g., Hadoop, Redshift, Snowflake, S3, and Cassandra). Lastly, data
engineers also typically address non-functional infrastructure requirements such as scalability,
reliability, durability, availability, backups, and so on.
Introduction to Data Analysis Tools
Data analysis tools such as R Programming, Tableau Public, Python, SAS, Apache Spark,
Excel, RapidMiner, KNIME, QlikView, Splunk, etc. are used to collect, interpret and present data
for a wide range of applications and industries so that these data can be used for the prediction and
sustainable growth of the business. Due to these tools, data analyzing has become easier for the
users, and due to its various types, it has created a huge opening in the market with a demand for
data analytics engineers.
6. Excel
Excel is a Microsoft software program that is part of the software productivity suite
Microsoft Office has developed. Excel is a core and common analytical tool generally used in
almost every industry. Excel is essential when analytics on the inner information of the customer is
required. It analyzes the complicated job of summarizing the information using a preview of pivot
tables to filter the information according to customer requirements. Excel has the advanced option
of business analytics to assist with the modeling of pre-created options such as automatic
relationship detection, DAX measures, and time grouping. Excel is used in general to calculate cells,
to pivot tables and to graph multiple instruments. For example, we can create a monthly budget for
Excel, track business expenses or sort and organize large amounts of data with an Excel table.
7. RapidMiner
RapidMiner is a strong embedded data science platform created by the same firm, which
carries out projective and other sophisticated analytics without any programming, such as data
mining, text analytics, machine training, and visual analysis. Including Access, Teradata, IBM
SPSS, Oracle, MySQL, Sybase, Excel, IBM DB2, Ingres, Dbase, etc., RapidMiner can also be used
to create any source information, including Access. The instrument is very strong that analytics
based on actual information conversion environments can be generated; for Example: For predictive
analysis, we can manage formats and information sets.
8. KNIME
KNIME The team of software engineers from Constance University was developed in
January 2004. Open-Source workflow platform for information processing building and execution.
KNIME utilizes nodes to build graphs that map information flow from input to output. With its
modular pipeline idea, KNIME is a major leading open-source reporting and built-in analytical tool
to evaluate and model the information through visual programming, integrate different data mining
elements and machine learning. Every node carries out a single workflow job. In the following
instance, a user reads certain information by using a File Reader node. The first 1000 rows are
subsequently filtered using a Row Filter node. Then, we can calculate summary statistics using a
statistics node, and the findings are finished by a CSV Writer on the users’ hard drive.
9. QlikView
QlikView has many distinctive characteristics, such as patented technology and memory
processing which can quickly execute the outcome for end customers and store the information in
the document itself. Data association is automatically retained in QlikView, and almost 10% of the
initial volume can be compressed. Color visualization of the information connection –for associated
information and non-related information, a particular color. As an auto service BI tool, QlikView is
usually easy to collect without having to have unique data analysis or programming abilities for
most company customers. It is often used in marketing, staffing, and sales departments as well as
in management dashboards to monitor general company transactions at the highest management
level. Most organizations provide company users with training before they are provided with
software access, while no unique abilities are needed.
10. Splunk
Its first version, most of it appreciated by its users, was launched in 2004. It gradually
became viral among businesses and began to purchase their company licenses. Splunk is a software
technology used to monitor, search, analyze and view information produced by the computer in
real-time. It can track and read various log files and save information on indexers as occurrences.
We can display information on different types of dashboards with these tools. Splunk retrieves all
text-based log information and offers an easy way to search through it; a user can retrieve all kinds
of information, conduct all kinds of interesting statistics and submit them in various formats.
11. IBM SPSS Modeler
A predictive Big Data Analytics Platform is IBM SPSS Modeler. It provides predictive
models and supplies people, organizations, systems and the company. It contains a variety of
sophisticated analytical and algorithms. IT Find out more quickly and fix issues by analyzing
structured and unstructured data SPSS Modeler doesn’t just explore wer information. It is most
potent when used to uncover strong patterns in wer continuing business processes and then
capitalize by deploying business models in order to better predict choices and achieve optimum
results.
Second, for RStudio, the installer can found at the following link:
https://www.rstudio.com/products/rstudio/download/
for Windows, Mac, and Linux (Ubuntu). Scroll due to “Installers for Supported Platforms”, open
the chosen platform’s link, run the installer after downloading, and follow the instructions. The
instructions for the installer will eventually ask us where R to be installed. Generally it defaults to
the correct path on the system for R though we may have to find where we have installed R and
type the path into the RStudio installer manually.
File locations:
There are three key places versions of a file may reside while version controlling with Git:
the working directory, the staging area, or the Git directory.
Working directory: It is a local folder for a project's files. This means that any folder created
anywhere on a system is a working directory.
Note:
Files in the modified state reside in the working directory.
The working directory is different from the .git directory. That is, we create a working
directory while Git creates a .git directory.
Staging area: It is technically called “index” in Git parlance — is a file, usually located in the .git
directory that stores information about files next-in-line to be committed into the .git directory.
Note:
Files in the staged state reside in the staging area.
Git directory: It is the folder (also called “repository”) that Git creates inside the working directory
we have instructed it to track. Also, the .git folder is where Git stores the object databases and
metadata of the file(s) we have instructed it to monitor.
Note:
The .git directory is the life of Git — it is the item copied when we clone a repository
from another computer (or from an online platform like GitHub).
Files in the committed state reside in the Git directory.
1. Modify files in the working directory. Note that any file we alter becomes a file in the modified
state.
2. Selectively stage where the files we want to commit to the .git directory.
Note that any file we stage (add) into the staging area becomes a file in the staged state.
Also, be aware that staged files are not yet in the .git database.
Staging means information about the staged file gets included in a file (called "index") in the .git
repository.
3. Commit the file(s) we have staged into the .git directory. That is, permanently store a snapshot of
the staged file(s) into the .git database. Note that any file version we commit to the .git directory
becomes a file in the committed state.
GitHub Demystified
GitHub is a web-based platform where users can host Git repositories. It helps us to facilitate
easy sharing and collaboration on projects with anyone at any time. It also encourages broader
participation in open-source projects by providing a secure way to edit files in another user's
repository.
To host (or share) a Git repository on GitHub, follow the steps below:
Step 1: Signup for a GitHub account
The first step to begin hosting on GitHub is to create a personal account. Visit the official
registration page to sign up.
Step 2: Create a remote repository in GitHub
After signing up for an account, create a home (a repository) in GitHub for the Git repository we
want to share.
Step 3: Connect the project’s Git directory to the remote repository
When a remote repository for the project is created, link the project’s .git directory — located locally
on the system — with the remote repository on GitHub.
To connect to the remote repository, go inside the root directory of the project where we want to
share via the local terminal, and run:
git remote add origin https://github.com/werusername/werreponame.git
Note:
Replace the username in the code above with the GitHub username.
Likewise, replace the resname with the name of the remote repository we want to connect
to.
The command above implies that git should add the specified URL to the local project as a
remote reference with which the local .git directory can interact.
The origin option in the command above is the default name (a short name) Git gives to the
server hosting the remote repository.
That is, instead of the server's URL, Git uses the short name origin.
It is not compulsory to stick with the server’s default name. If we prefer another name rather
than origin, simply substitute the origin name in the git remote add command above with
any name we prefer.
Always remember that a server’s short name (for example, origin) is nothing special! It only
exists — locally — to help we easily reference the server’s URL. So feel to change it to a
short name we can easily reference.
To rename any existing remote URL, use the git remote rename command like so:
git remote rename theCurrentURLName werNewURLName
Whenever we clone (download) any remote repo, Git automatically names that repo’s URL
origin. However, we can specify a different name with the git clone -o the PreferredName
command.
To see the exact URL stored for nicknames like origin, run git remote -v command.
Step 4: Confirm the connection
Once we’ve connected the Git directory to the remote repository, check whether the connection was
successful by running git remote -v on the command line.
Afterward, check the output to confirm that the displayed URL is the same as the remote URL we
intend to connect to.
Step 5: Push a local Git repo to the remote repo
After successfully connecting wer local directory to the remote repository, we can then begin
to push (upload) the local project upstream. Whenever we are ready to share the project elsewhere,
on any remote repo, simply instruct Git to push all the commits, branches, and files in the local .git
directory to the remote repository. The code syntax used to upload (push) a local Git directory to a
remote repository is git push -u remoteName branchName. That is, to push the local .git directory,
and assuming the remote URL’s short name is “origin”, run:
git push -u origin master
Note:
The command above implies that git should push the local master branch to the remote
master branch located at the URL named origin.
Technically, we can substitute the origin option with the remote repository’s URL.
Remember, the origin option is only a nickname of the URL we’ve registered into the local
.git directory.
The -u flag (upstream/tracking reference flag) automatically links the .git directory's local
branch with the remote branch. This allows us to use git pull without any arguments.
Step 6: Confirm the upload
Lastly, the GitHub repository page to confirm that Git has successfully pushed the local Git
directory to the remote repository.
Note:
We may need to refresh the remote repository's page for the changes to reflect.
GitHub also has a free optional facility to convert the remote repository into a functional
website. Let see “how” below.
5 Git workflows and branching strategy we can use to improve wer development
process
Different git workflows, their benefits, their cons are given below.
1. Basic Git Workflow
The most basic git workflow is the one where there is only one branch — the master branch.
Developers commit directly into it and use it to deploy to the staging and production environment.
Basic Git Workflow with all commits getting added directly to master branch. This
workflow isn’t usually recommended unless we’re working on a side project and we’re looking to
get started quickly. Since there is only one branch, there really is no process over here. This makes
it effortless to get started with Git. However, some cons we need to keep in mind when using this
workflow are:
1. Collaborating on code will lead to multiple conflicts.
2. Chances of shipping buggy software to production is higher.
3. Maintaining clean code is harder.
4. Gitflow Workflow
The gitflow workflow is very similar to the previous workflow we discussed combined with
two other branches — the release branch and the hot-fix branch.
The hot-fix branch:
The hot-fix branch is the only branch that is created from the master branch and directly
merged to the master branch instead of the develop branch. It is used only when we have to quickly
patch a production issue. An advantage of this branch is, it allows we to quickly deploy a production
issue without disrupting others’ workflow or without having to wait for the next release cycle.
Once the fix is merged into the master branch and deployed, it should be merged into both
develop and the current release branch. This is done to ensure that anyone who forks off develop to
create a new feature branch has the latest code.
The release branch:
The release branch is forked off of develop branch after the develop branch has all the
features planned for the release merged into it successfully. No code related to new features is added
into the release branch. Only code that relates the release is added to the release branch. For
example, documentation, bug fixes, and other tasks related to this release are added to this branch.
Once this branch is merged with master and deployed to production, it’s also merged back
into the develop branch, so that when a new feature is forked off of develop, it has the latest code.
Language support
Tools convert to pdf/doc/html
Feature Description
pandoc tool to convert these content to pdf,word and HTML documents
GitHub Flavored Markdown GitHub markdown used to format and syntax highlight blocks
What is markdown used for?
It is a text content written in text format written by Content writers and convert into different
formats like html pdf. Input is plain easy readable text with special syntax, and application convert
to different format, Output is HTML, PDF or word. These can be create and written with any
popular text editors. It can be used in many ways
Writing content in static generators and generates html content
Can also used in Slack for communicating users with this special syntax
JIra content can include this syntax for html elements