0% found this document useful (0 votes)
34 views

Data Science-New (Unit-I)

Uploaded by

Abhisek Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Data Science-New (Unit-I)

Uploaded by

Abhisek Nayak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

A Brief Introduction to Data Science

Data Science, Big Data, Data, and the Data Science process.
Introduction
Data Scientists have the ability to find patterns and insights in oceans of data, akin to an
astronomer looking out into the deep space with telescopes to find new planets and galaxies and
black holes in the midst of billions of stars and other galaxies. Data Science, fundamentally like
science, is used to answer questions about the world by combining different fields, i.e. Mathematics,
Computer Science, Philosophy, etc., along with distinct methodologies and novel technology to
augment and enhance our ability to answer them well.

What is Data Science?


Based on the report by The Economist Special:
A data scientist is broadly defined as someone who combines the skills of software programmer,
statistician, and storyteller/artists to extract the nuggets of gold hidden under mountains of data. In
other words it is an amalgamation of skills — Statistics, Mathematics, Computer Science, data
wrangling, cleaning, visualization, analysis, etc.
It’s the mastering of all these skills that makes a data scientist valuable in this era of big data. More
importantly, what makes Data Science crucial is its ability to form a connection from numbers and
figures and turning them into insights that aid in fact-based and faster decision making which is
crucial in this fast-paced world. Like a machine that turns simple carbon into precious diamonds,
companies and businesses are now able to utilize data, which were fruitless by itself, in many ways.
Aside from business that equips data scientists for the purpose of profit and expansion, data science
(prediction and inference) also has myriads of uses that contribute to social good with predictions
and inferences and tell stories with beautiful visualizations of data — such as predicting diseases,
natural disasters, or crime, and visualizing the latest updates of the COVID-19 figures, the slow rise
of global temperature due to the effects of climate change or the gradual decline of poverty or hunger
at a global scale.

Data Science Venn Diagram


According to the infamous Data Science Venn
Diagram by Drew Conway, Data Science can be equated to:
Hacking Skills + Math and Statistics Knowledge +
Substantive Expertise
Substantive expertise = asking the right question and
formulating the problem well
Hacking skills = extracting and mining data, cleaning and
formatting data, visualizing and analyzing data, writing
concise machine learning and deep learning code, making
use of tools and packages.
Math and Statistics = underlying knowledge for performing
analysis and using statistical learning techniques to transfer information into something useful
(prediction and inference) and making decisions

Why is Data Science a thing now?


Data Science actually has a long history that goes way back into the 60s. But the reason it’s
so popular now is (1) firstly, due to the exponential growth of data (big data), according to
Moore’s Law, and (2) the rise of inexpensive computing due to innovation and advancements in
technology.
Both these reasons were the perfect ingredient to create the quintessential environment to
enrich data as well as the tools used to analyse it, upgrading computer infrastructures (memory,
processors, software, etc) along the way. What we’re able to do today, with could infrastructures
and supercomputers would’ve been fiction back then, and we’re still making breakthroughs and
breaking barriers each day
Data Science is closely related to Big Data, since the more data we have, the more accurate
can be drawn in conclusion during the analysis.

What is Big Data?


According to Oxford Languages, Big data is defined as: Extremely large data sets that may
be analysed computationally to reveal patterns, trends, and associations, especially relating to
human behaviour and interactions. In other words, big data is simply defined as a large collection
of data.
3 Vs of Big Data
There are 3 characteristics of big data The 3Vs of Big Data that can help us to understand
more about the term.
1. Volume
 The amount of generated and stored data. Its size determines the value and potential
insights and categorizes a data set as big data.
2. Variety
 The type and nature of data, in the form of structured or unstructured; qualitative
or quantitative, ie images, text, audio, video, etc.
3. Velocity
 The speed at which data is generated and processed. Big data is often generated in
real-time, such as WeTube videos watched every day globally or the number of
COVID-19 cases worldwide.

What is Data?
Data is the ingredient for data science processes, and by understand what data is can help us
to be more efficient and appreciate what data science is all about.
1. According to the Cambridge English Dictionary: Information, especially facts or
numbers, collected to be examined and considered and used to help decision-making.
2. According to Wikipedia: A set of values of qualitative or quantitative variables.
Based on the definition by Wikipedia, data can be broken down into the terms set, variables,
qualitative and quantitative.

Set
The population from which data is drawn
Variables
 Input variable (X, predictor, independent variable)
 Output variable (Y, response, dependent variable)
Quantitative
 information about quantity (can be counted and measured)
 age, height, weight, number of cases, etc.
Qualitative
 descriptive variables (can be observed but not measured)
 color, blood type, infected or not, address, etc.

Example
Taking the COVID-19 pandemic as an example, let’s say we want to visualize the number of
confirmed cases in the US with a simple scatter plot,
 set - the confirmed cases of the United States.
 Independent variable, X - time (days)
 Dependent variable, Y - the number of confirmed
 Both X and Y are quantitative variables
The result of the plot can also be used to depict the relationship between X and Y, either a
positive or negative correlation. With the use of statistical learning techniques, algorithms such as
linear regression can be used to build models for predictions and inference purposes.

Data is messy and not perfect


When we work on data science, data always messy and unstructured, and it takes skills,
patience, and time to clean data and structure it, so that it’s ready to use. Take image data, for
example, if we were to build a facial recognition model that detects a face, the input images could
be dark, grainy, or blurry, etc. These messy image data can be difficult to deal with. Another aspect
is missing data, data-mined, or collected from the real world is often bombarded with missing
information, and several techniques are implemented to deal with them.

Sources of Data
Data comes from many places, especially in this time where smartphone usage has
dramatically increased due to social media, and the rise of streaming services such as Netflix and
Spotify. Data can be categorized into internal or external, where internal is information generated
within a business, such as finance, and external is information from the customer, usage analytics,
etc. Good data is also often hard to find, in most cases, we’ll have to mine it from the internet to
perform analyses, and lots of cleaning is required for it to be useful.

Data is of Secondary importance


The most important rule for that data scientists should adhere to is to always ask questions
first before seeking out data. Just as the scientific method starts off with a hypothesis, data science
starts off with questions that are crucial to solving the problem at hand.

As Einstein puts it: “If I had an hour to solve a problem and my life depended on the solution,
I would spend the first 55 minutes determining the proper question to ask… for once I know the
proper question, I could solve the problem in less than five minutes.”

The Data Science Process


Similar to the scientific method, Data science has a process that turns data into insights.
The process can be briefly summarized below.
Data Science starts off with (1) generating questions which helps we to understand the
problem well, after questions are well-formulated it’s time to (2) gather the data from relevant
sources using data science techniques. After data is collected, it’s time to (3) clean the data, which
formats the data and prepares it for the next step, which is (4) analysis and exploration, where
statistical methods are used to discover hidden patterns and relationships. After this is (5) modeling,
where machine learning models are constructed for prediction and inference. Lastly, the (5) results
are communicated to inform others and used for decision making.

To summarize:
1. Formulating the question
2. Data Collection
3. Data Cleaning
4. Data Analysis and Exploration
5. Data Modelling
6. Communicating results
What do Data Scientists do?
Data Scientists have far-reaching applications in multitudinous areas. But to give have an
idea of how Data Science contributes to society, let’s take a look at Nate Silver, the founder and
editor in chief of FiveThirtyEight, who uses statistical analysis to convey compelling stories and
topics mainly about elections, politics, sports, science, economics, etc.
One of his most notable work was on the 2016 election, which produced accurate predictions
based on statistical techniques. We can also check out the website’s forecast on the 2020 Election
for the United States.
One important lesson from that example is that data science tools and methodologies do
merely increases efficacy and speed, but to really make use of it requires the knowledge and ability
to choose the right factors, while eliminating the wrong ones, to produce the right conclusion. And
to do so requires one to ask the right questions in the first place. This is something we have to
understand if we’re deciding to go into this field.

Data Science Goals and Deliverables


In order to understand the importance of these pillars, one must first understand the typical
goals and deliverables associated with data science initiatives, and also the data science process
itself. Let’s first discuss some common data science goals and deliverables.
Here is a short list of common data science deliverables:
 Prediction (predict a value based on inputs)
 Classification (e.g., spam or not spam)
 Recommendations (e.g., Amazon and Netflix recommendations)
 Pattern detection and grouping (e.g., classification without known classes)
 Anomaly detection (e.g., fraud detection)
 Recognition (image, text, audio, video, facial, …)
 Actionable insights (via dashboards, reports, visualizations, …)
 Automated processes and decision-making (e.g., credit card approval)
 Scoring and ranking (e.g., FICO score)
 Segmentation (e.g., demographic-based marketing)
 Optimization (e.g., risk management)
 Forecasts (e.g., sales and revenue)

The Data Science Process


Particularly, Data scientists usually follow a process when creating models using machine
learning and related techniques as given below. The GABDO Process Model consists of five
iterative phases—goals, acquire, build, deliver, optimize—hence, represented by the acronym
GABDO. Each phase is iterative because any phase can loop back to one or more phases before.
Feel free to check out the book if we’d like to learn more about the process and its details.
The GABDO Process Model (Data Scientist Pillars, Skills, and Education In-Depth)
It’s true that many of off-the-shelf products can be used relatively easily, and one can
probably obtain pretty decent results depending on the problem being solved, but there are many
aspects of data science where experience and chops are critically important.
Some of these include having the ability to:
 Customize the approach and solution to the specific problem at hand in order to maximize
results, including the ability to write new algorithms and/or significantly modify the existing
ones, as needed
 Access and query many different databases and data sources (RDBMS, NoSQL, NewSQL),
as well as integrate the data into an analytics-driven data source (e.g., OLAP, warehouse,
data lake, …)
 Find and choose the optimal data sources and data features (variables), including creating
new ones as needed (feature engineering)
 Understand all statistical, programming, and library/package options available, and select
the best
 Ensure data has high integrity (good data), quality (the right data), and is in optimal form
and condition to guarantee accurate, reliable, and statistically significant results
 Avoid the issues associated with garbage in equals garbage out
 Select and implement the best tooling, algorithms, frameworks, languages, and technologies
to maximize results and scale as needed
 Choose the correct performance metrics and apply the appropriate techniques in order to
maximize performance
 Discover ways to leverage the data to achieve business goals without guidance and/or
deliverables being dictated from the top down, i.e., the data scientist as the idea person
 Work cross-functionally, effectively, and in collaboration with all company departments and
groups
 Distinguish good from bad results, and thus mitigate the potential risks and financial losses
that can come from erroneous conclusions and subsequent decisions
 Understand product (or service) customers and/or users, and create ideas and solutions with
them in mind

The “Science” in Data Science


The term science is usually synonymous with the scientific method, and some of we may
have noticed that the process outlined above is very similar to the process characterized by the
expression, scientific method. Here is an image that visualizes the scientific method as an ongoing
process.
Data Scientists vs. Data Analysts vs. Data Engineers
As mentioned, often the data scientist role is confused with other similar roles. The two main
ones are data analysts and data engineers, both quite different from each other, and from data science
as well.
Data Analyst
Data analysts share many of the same skills and responsibilities as a data scientist, and
sometimes have a similar educational background as well. Some of these shared skills include the
ability to:
 Access and query (e.g., SQL) different data sources
 Process and clean data
 Summarize data
 Understand and use some statistics and mathematical techniques
 Prepare data visualizations and reports
Some of the key differences however, are that data analysts typically are not computer
programmers, nor responsible for statistical modelling, machine learning, and many of the other
steps outlined in the data science process above.
The tools used are usually different as well. Data analysts often use tools for analysis and business
intelligence like Microsoft Excel (visualization, pivot tables, …), Tableau, SAS, SAP, and Qlik.
Analysts sometimes perform data mining and modeling tasks, but tend to use visual platforms such
as IBM SPSS Modeler, Rapid Miner, SAS, and KNIME. Data scientists, on the other hand, perform
these same tasks usually with tools such as R and Python, combined with relevant libraries for the
language(s) being used. Lastly, data analysts tend to differ significantly in their interactions with
top business managers and executives. Data analysts are often given questions and goals from the
top down, perform the analysis, and then report their findings.

Data Engineer
Data engineers are becoming more important in the age of big data, and can be thought of
as a type of data architect. They are less concerned with statistics, analytics, and modelling as their
data scientist/analyst counterparts, and are much more concerned with data architecture, computing
and data storage infrastructure, data flow, and so on.
The data used by data scientists and big data applications often come from multiple sources,
and must be extracted, moved, transformed, integrated, and stored (e.g., ETL/ELT) in a way that’s
optimized for analytics, business intelligence, and modelling. Data engineers are therefore
responsible for data architecture, and for setting up the required infrastructure. As such, they need
to be competent programmers with skills very similar to someone in a DevOps role, and with strong
data query writing skills as well. Another key aspect of this role is database design (RDBMS,
NoSQL, and NewSQL), data warehousing, and setting up a data lake. This means that they must be
very familiar with many of the available database technologies and management systems, including
those associated with big data (e.g., Hadoop, Redshift, Snowflake, S3, and Cassandra). Lastly, data
engineers also typically address non-functional infrastructure requirements such as scalability,
reliability, durability, availability, backups, and so on.
Introduction to Data Analysis Tools
Data analysis tools such as R Programming, Tableau Public, Python, SAS, Apache Spark,
Excel, RapidMiner, KNIME, QlikView, Splunk, etc. are used to collect, interpret and present data
for a wide range of applications and industries so that these data can be used for the prediction and
sustainable growth of the business. Due to these tools, data analyzing has become easier for the
users, and due to its various types, it has created a huge opening in the market with a demand for
data analytics engineers.

Top Data Analysis Tool


1. R Programming
What if I say Project R, a GNU project, has been published in R? This is written mainly in
C and Fortran. And many modules have been drawn up in R alone. It is a free language and software
for statistical computing and graphics programming. R is the industry’s leading analytical tool,
commonly used in data modeling and statistics. We can manipulate and present wer information
readily in various ways. SAS has in numerous ways exceeded data capacity, performance, and
results. R compiles and operates on many platforms, including -macOS, Windows, and Linux. t has
the option to navigate packages by category 11,556 packages. R also offers instruments to install
all the packages automatically, which can be well-assembled with large information according to
the user’s needs.
2. Tableau Public
Tableau Public offers free software that links any information source, including corporate
data warehouse, web-based information, or Microsoft Excel, generates information displays,
dashboards, maps, and so on and that present on the web in real-time. It can be communicated with
the customer or via social media. Access to the file can be downloaded in various formats. We need
very good data sources if we’d like to see the power of the tableau. The big data capacities of
Tableau make information essential and better than any other data visualization software on the
market can be analyzed and visualized.
3. Python
Python is an object-oriented, user-friendly as well as open-source language that can be read,
written, maintained, and free. Guido van Rossum created it in the early 1980s, supporting both
functional and structured techniques of programming. Python is simple to know because JavaScript,
Ruby, and PHP are very comparable. Python also has very nice libraries for machine learning, e.g.
Keras, TensorFlow, Theano, and Scikitlearn. As we all know that python is an important feature
because of that python can assemble on any platform such as MongoDB, JSON, SQL Server, and
many more. We can also say that python can also handle the data text in a very great manner. Python
is quite simple, so it is easy to know, and for that, we need as a uniquely readable syntax. The
developers can be much easier than other languages to read and translate Python code.
4. SAS
SAS stands for Statistical Analysis System. It was created by the SAS Institute in 1966 and
further developed in the 1980s and 1990s; It is a programming environment and language for data
management and an analytical leader. SAS is readily available, easy to manage, and information
from all sources can be analyzed. In 2011, SAS launched a wide range of customer intelligence
goods and many SAS modules, commonly applied to client profiling and future opportunities, for
Web, social media, and marketing analytics. It can also predict, manage and optimize their behavior.
It uses memory and distributed processing to quickly analyze enormous databases. Also, this
instrument helps to model predictive information.
5. Apache Spark
Apache was created in 2009 by the University of California, AMP Lab of Berkeley. Apache
Spark is a quick-scale data processing engine and runs apps 100 times quicker in memory and 10
times quicker on disk in Hadoop clusters. Spark is based on data science, and its idea facilitates data
science. Spark is also famous for the growth of information pipelines and machine models. Spark
also has a library – MLlib that supplies a number of machine tools for recurring methods in the
fields of information science such as regression, grading, clustering, collaborative filtration, etc.
Apache Software Foundation launched Spark to speed up the Hadoop software computing process.

6. Excel
Excel is a Microsoft software program that is part of the software productivity suite
Microsoft Office has developed. Excel is a core and common analytical tool generally used in
almost every industry. Excel is essential when analytics on the inner information of the customer is
required. It analyzes the complicated job of summarizing the information using a preview of pivot
tables to filter the information according to customer requirements. Excel has the advanced option
of business analytics to assist with the modeling of pre-created options such as automatic
relationship detection, DAX measures, and time grouping. Excel is used in general to calculate cells,
to pivot tables and to graph multiple instruments. For example, we can create a monthly budget for
Excel, track business expenses or sort and organize large amounts of data with an Excel table.
7. RapidMiner
RapidMiner is a strong embedded data science platform created by the same firm, which
carries out projective and other sophisticated analytics without any programming, such as data
mining, text analytics, machine training, and visual analysis. Including Access, Teradata, IBM
SPSS, Oracle, MySQL, Sybase, Excel, IBM DB2, Ingres, Dbase, etc., RapidMiner can also be used
to create any source information, including Access. The instrument is very strong that analytics
based on actual information conversion environments can be generated; for Example: For predictive
analysis, we can manage formats and information sets.
8. KNIME
KNIME The team of software engineers from Constance University was developed in
January 2004. Open-Source workflow platform for information processing building and execution.
KNIME utilizes nodes to build graphs that map information flow from input to output. With its
modular pipeline idea, KNIME is a major leading open-source reporting and built-in analytical tool
to evaluate and model the information through visual programming, integrate different data mining
elements and machine learning. Every node carries out a single workflow job. In the following
instance, a user reads certain information by using a File Reader node. The first 1000 rows are
subsequently filtered using a Row Filter node. Then, we can calculate summary statistics using a
statistics node, and the findings are finished by a CSV Writer on the users’ hard drive.
9. QlikView
QlikView has many distinctive characteristics, such as patented technology and memory
processing which can quickly execute the outcome for end customers and store the information in
the document itself. Data association is automatically retained in QlikView, and almost 10% of the
initial volume can be compressed. Color visualization of the information connection –for associated
information and non-related information, a particular color. As an auto service BI tool, QlikView is
usually easy to collect without having to have unique data analysis or programming abilities for
most company customers. It is often used in marketing, staffing, and sales departments as well as
in management dashboards to monitor general company transactions at the highest management
level. Most organizations provide company users with training before they are provided with
software access, while no unique abilities are needed.
10. Splunk
Its first version, most of it appreciated by its users, was launched in 2004. It gradually
became viral among businesses and began to purchase their company licenses. Splunk is a software
technology used to monitor, search, analyze and view information produced by the computer in
real-time. It can track and read various log files and save information on indexers as occurrences.
We can display information on different types of dashboards with these tools. Splunk retrieves all
text-based log information and offers an easy way to search through it; a user can retrieve all kinds
of information, conduct all kinds of interesting statistics and submit them in various formats.
11. IBM SPSS Modeler
A predictive Big Data Analytics Platform is IBM SPSS Modeler. It provides predictive
models and supplies people, organizations, systems and the company. It contains a variety of
sophisticated analytical and algorithms. IT Find out more quickly and fix issues by analyzing
structured and unstructured data SPSS Modeler doesn’t just explore wer information. It is most
potent when used to uncover strong patterns in wer continuing business processes and then
capitalize by deploying business models in order to better predict choices and achieve optimum
results.

Introducing R and Rstudio


Introduction
R is necessary for many statisticians and data scientists due to it’s computation power and great
flexibility, it may not be clear how R is useful for scientists looking to do data processing and simple
analyses. However, R has many advantages over other popular data analysis software suites such
as SAS or SPSS, even for investigators who do not require R’s advanced capabilities.

1. R is free to download and use


 SAS and SPSS are very costly to use and/or require access through an employer license
 R is easy to download and install on Windows, Mac, and Linux
 R takes up less space to install then other such software
2. R is open-source
 Users can expand the functionality of R through add-ons called packages
 Capabilities of R are continually growing as doesn’t require large-scale releases to
expand functionality
3. Data processing in R is very easy
 Can import datasets from most other programs, including Excel, SAS, and SPSS
 Creating subsets of wer data, creating new variables, selecting specific observations and
variables is very easy
 Great flexibility in these tools
4. Data visualization tools in R are very extensive
 Very flexible tools for creating custom graphs and tables
5. Advanced functionality often used in practice by scienitists is available in R
 Very robust mixed modelling, principal component, factor analysis, structural equation
modelling, etc.
6. Will improve one’s understanding of statistics
 Key part of understanding statistics is through real data analysis
 Due to the breadth of its data analysis tools and the way R is operated, understanding R
will foster an improved understanding of statistics
7. It is very easy to share the output from R
 Can easily select to display only the output we are interested in
 Easy to save and load figures, datasets, etc. created in R
 Due to R using programming through scripts (to be discussed later), wer analyses are
completely and easily reproducible and shareable
 Can save the results and code in a report-style form with notes using R Markdown (to
be discussed later)
8. R provides reproducibility for wer analyses
 Use if scripts means every step of wer analysis is documented and can be easily shared

R and RStudio: What is the difference?


Initially, it is confused as to the difference between R and RStudio. RStudio is actually an
add-on to R: it takes the R software and adds to it a very user-friendly graphical interface. Thus,
when one uses RStudio, they are still using the full version of R while also getting the benefit of
greater functionality and usability due to an improved user interface. As a result, when using R, one
should always use RStudio; working with R itself is very cumbersome. Since RStudio is an add-on
to R, we must first download and install R as well as RStudio, two steps which are done separately.
On wer computer, we will see R and RStudio as separate installed programs. When using R for data
analysis, we will always open and work in RStudio; we must leave R installed on the computer for
RStudio to work, even though we will likely never open R itself.

Installing R and RStudio


To work efficiently, we must download and install both R and RStudio. First, the installer
for R can found by opening the following link:
https://cran.r-project.org/mirrors.html
and then selecting the mirror link closest the location. After opening this mirror link, we will see
“Download and Install R” with links for Windows, Mac, and Linux installers. Always select the
newest version posted. Then run the installer and follow the instructions.

Second, for RStudio, the installer can found at the following link:
https://www.rstudio.com/products/rstudio/download/
for Windows, Mac, and Linux (Ubuntu). Scroll due to “Installers for Supported Platforms”, open
the chosen platform’s link, run the installer after downloading, and follow the instructions. The
instructions for the installer will eventually ask us where R to be installed. Generally it defaults to
the correct path on the system for R though we may have to find where we have installed R and
type the path into the RStudio installer manually.

Global Information Tracker (GiT)


Git is a Distributed Version Control System (DVCS) used to save different versions of a file
(or set of files) so that any version is retrievable. It is also makes it easy to record and compare
different file versions on the subject like what changed, who changed what, or who initiated an issue
etc. are reviewable anytime.
The term “distributed” means that whenever we instruct Git to share a project’s directory,
Git does not only share the latest file version. Instead, it distributes every version it has recorded for
that project. It is in sharp contrast to other version control systems. They only share whatever single
version a user has explicitly checked out from the central/local database.

What is a Version Control System?


A Version Control System (VCS) refers to the method used to save a file's versions for future
reference. Intuitively, many people already version control their projects by renaming different
versions of the same file in various ways like blogScript.js, blogScript_v2.js, blogScript_v3.js,
blogScript_final.js, blogScript_definite_final.js, and so on. But this approach is error-prone and
ineffective for team projects. Also, tracking what changed, who changed it, and why it was changed
is a tedious endeavor with this traditional approach. This illuminates the importance of a reliable
and collaborative version control system like Git. However, to get the best of Git, it is essential to
understand how Git handles the files.

Files states in Git


In Git, there are three primary states (conditions) in which a file can be: modified state,
staged state, or committed state.
Modified state: A file in the modified state is a revised — but uncommitted (unrecorded) — file.
In other words, files in the modified state are files we have modified but have not explicitly
instructed Git to monitor.
Staged state: Files in the staged state are modified files that have been selected — in their current
state (version) — and are being prepared to be saved (committed) into the .git repository during the
next commit snapshot. Once a file gets staged, it implies that we have explicitly authorized Git to
monitor that file’s version.
Committed state: Files in the committed state are files successfully stored into the .git repository.
Thus, a committed file is a file in which we have recorded its staged version into the Git directory
(folder).

File locations:
There are three key places versions of a file may reside while version controlling with Git:
the working directory, the staging area, or the Git directory.

Working directory: It is a local folder for a project's files. This means that any folder created
anywhere on a system is a working directory.
Note:
 Files in the modified state reside in the working directory.
 The working directory is different from the .git directory. That is, we create a working
directory while Git creates a .git directory.
Staging area: It is technically called “index” in Git parlance — is a file, usually located in the .git
directory that stores information about files next-in-line to be committed into the .git directory.
Note:
 Files in the staged state reside in the staging area.
Git directory: It is the folder (also called “repository”) that Git creates inside the working directory
we have instructed it to track. Also, the .git folder is where Git stores the object databases and
metadata of the file(s) we have instructed it to monitor.
Note:
 The .git directory is the life of Git — it is the item copied when we clone a repository
from another computer (or from an online platform like GitHub).
 Files in the committed state reside in the Git directory.

The basic Git workflow


Working with the Git Version Control System looks something like this:

1. Modify files in the working directory. Note that any file we alter becomes a file in the modified
state.

2. Selectively stage where the files we want to commit to the .git directory.
Note that any file we stage (add) into the staging area becomes a file in the staged state.
Also, be aware that staged files are not yet in the .git database.
Staging means information about the staged file gets included in a file (called "index") in the .git
repository.
3. Commit the file(s) we have staged into the .git directory. That is, permanently store a snapshot of
the staged file(s) into the .git database. Note that any file version we commit to the .git directory
becomes a file in the committed state.

GitHub Demystified
GitHub is a web-based platform where users can host Git repositories. It helps us to facilitate
easy sharing and collaboration on projects with anyone at any time. It also encourages broader
participation in open-source projects by providing a secure way to edit files in another user's
repository.
To host (or share) a Git repository on GitHub, follow the steps below:
Step 1: Signup for a GitHub account
The first step to begin hosting on GitHub is to create a personal account. Visit the official
registration page to sign up.
Step 2: Create a remote repository in GitHub
After signing up for an account, create a home (a repository) in GitHub for the Git repository we
want to share.
Step 3: Connect the project’s Git directory to the remote repository
When a remote repository for the project is created, link the project’s .git directory — located locally
on the system — with the remote repository on GitHub.

To connect to the remote repository, go inside the root directory of the project where we want to
share via the local terminal, and run:
git remote add origin https://github.com/werusername/werreponame.git
Note:
 Replace the username in the code above with the GitHub username.
Likewise, replace the resname with the name of the remote repository we want to connect
to.
 The command above implies that git should add the specified URL to the local project as a
remote reference with which the local .git directory can interact.
 The origin option in the command above is the default name (a short name) Git gives to the
server hosting the remote repository.
 That is, instead of the server's URL, Git uses the short name origin.
 It is not compulsory to stick with the server’s default name. If we prefer another name rather
than origin, simply substitute the origin name in the git remote add command above with
any name we prefer.
 Always remember that a server’s short name (for example, origin) is nothing special! It only
exists — locally — to help we easily reference the server’s URL. So feel to change it to a
short name we can easily reference.
 To rename any existing remote URL, use the git remote rename command like so:
git remote rename theCurrentURLName werNewURLName
 Whenever we clone (download) any remote repo, Git automatically names that repo’s URL
origin. However, we can specify a different name with the git clone -o the PreferredName
command.
 To see the exact URL stored for nicknames like origin, run git remote -v command.
Step 4: Confirm the connection
Once we’ve connected the Git directory to the remote repository, check whether the connection was
successful by running git remote -v on the command line.
Afterward, check the output to confirm that the displayed URL is the same as the remote URL we
intend to connect to.
Step 5: Push a local Git repo to the remote repo
After successfully connecting wer local directory to the remote repository, we can then begin
to push (upload) the local project upstream. Whenever we are ready to share the project elsewhere,
on any remote repo, simply instruct Git to push all the commits, branches, and files in the local .git
directory to the remote repository. The code syntax used to upload (push) a local Git directory to a
remote repository is git push -u remoteName branchName. That is, to push the local .git directory,
and assuming the remote URL’s short name is “origin”, run:
git push -u origin master
Note:
 The command above implies that git should push the local master branch to the remote
master branch located at the URL named origin.
 Technically, we can substitute the origin option with the remote repository’s URL.
Remember, the origin option is only a nickname of the URL we’ve registered into the local
.git directory.
 The -u flag (upstream/tracking reference flag) automatically links the .git directory's local
branch with the remote branch. This allows us to use git pull without any arguments.
Step 6: Confirm the upload
Lastly, the GitHub repository page to confirm that Git has successfully pushed the local Git
directory to the remote repository.
Note:
 We may need to refresh the remote repository's page for the changes to reflect.
 GitHub also has a free optional facility to convert the remote repository into a functional
website. Let see “how” below.
5 Git workflows and branching strategy we can use to improve wer development
process

Different git workflows, their benefits, their cons are given below.
1. Basic Git Workflow
The most basic git workflow is the one where there is only one branch — the master branch.
Developers commit directly into it and use it to deploy to the staging and production environment.

Basic Git Workflow with all commits getting added directly to master branch. This
workflow isn’t usually recommended unless we’re working on a side project and we’re looking to
get started quickly. Since there is only one branch, there really is no process over here. This makes
it effortless to get started with Git. However, some cons we need to keep in mind when using this
workflow are:
1. Collaborating on code will lead to multiple conflicts.
2. Chances of shipping buggy software to production is higher.
3. Maintaining clean code is harder.

2. Git Feature Branch Workflow


The Git Feature Branch workflow becomes a must have when we have more than one
developer working on the same codebase. Imagine one developer who is working on a new feature
and another developer working on a second feature. Now, if both the developers work from the
same branch and add commits to them, it would make the codebase a huge mess with plenty of
conflicts.

Git workflow with feature branches:


To avoid this, the two developers can create two separate branches from the master branch
and work on their features individually. When they’re done with their feature, they can then merge
their respective branch to the master branch, and deploy without having to wait for the second
feature to be completed.
The Pros of using this workflow is, the git feature branch workflow allows us to collaborate
on code without having to worry about code conflicts.

3. Git Feature Workflow with Develop Branch


This workflow is one of the more popular workflows among developer teams. It’s similar
to the Git Feature Branch workflow with a develop branch that is added in parallel to the master
branch. In this workflow, the master branch always reflects a production-ready state. Whenever the
team wants to deploy to production they deploy it from the master branch. The develop branch
reflects the state with the latest development changes for the next release. Developers create
branches from the develop branch and work on new features. Once the feature is ready, it is tested,
merged with develop branch, tested with the develop branch’s code in case there was a prior merge,
and then merged with master.

Git workflow with feature and develop branches:


The advantage of this workflow is, it allows teams to consistently merge new features, test
them in staging, and deploy to production. While maintaining code is easier, it can get a little
tiresome for some teams since it can feel like going through a tedious process.

4. Gitflow Workflow
The gitflow workflow is very similar to the previous workflow we discussed combined with
two other branches — the release branch and the hot-fix branch.
The hot-fix branch:
The hot-fix branch is the only branch that is created from the master branch and directly
merged to the master branch instead of the develop branch. It is used only when we have to quickly
patch a production issue. An advantage of this branch is, it allows we to quickly deploy a production
issue without disrupting others’ workflow or without having to wait for the next release cycle.
Once the fix is merged into the master branch and deployed, it should be merged into both
develop and the current release branch. This is done to ensure that anyone who forks off develop to
create a new feature branch has the latest code.
The release branch:
The release branch is forked off of develop branch after the develop branch has all the
features planned for the release merged into it successfully. No code related to new features is added
into the release branch. Only code that relates the release is added to the release branch. For
example, documentation, bug fixes, and other tasks related to this release are added to this branch.
Once this branch is merged with master and deployed to production, it’s also merged back
into the develop branch, so that when a new feature is forked off of develop, it has the latest code.

Gitflow workflow with hotfix and release branches:


This workflow was first published and made popular by Vincent Driessen and since then it
has been widely used by organizations that have a scheduled release cycle. Since the git-flow is a
wrapper around Git, we can install git-flow in the current repository. It's a straightforward process
and it doesn't change anything in the repository other than creating branches for we.

5. Git Fork Workflow


The Fork workflow is popular among teams who use open-source software. The flow usually
looks like this:
1. The developer forks the open-source software’s official repository. A copy of this repository
is created in their account.
2. The developer then clones the repository from their account to their local system.
3. A remote path for the official repository is added to the repository that is cloned to the local
system.
4. The developer creates a new feature branch is created in their local system, makes changes,
and commits them.
5. These changes along with the branch are pushed to the developer’s copy of the repository
on their account.
6. A pull request from the branch is opened to the official repository.
7. The official repository’s manager checks the changes and approves the changes to get
merged into the official repository.

General workflow is as follows −


 We clone the Git repository as a working copy.
 We modify the working copy by adding/editing files.
 If necessary, we also update the working copy by taking other developer's changes.
 We review the changes before commit.
 We commit changes. If everything is fine, then we push the changes to the repository.
 After committing, if we realize something is wrong, then we correct the last commit and
push the changes to the repository.
Shown below is the pictorial representation of the work-flow.
MARKDOWN
Markdown is an open source markup language created by John Grubcer. It is used to write
plain formatted text and readable with special syntax, and finally converted to HTML. It is simple
and fun to learn this. It helps users to write plain text and convert to multiple formats like html, pdf,
etc
Advantages
 It is very easy to read and write plan text which can be converted to Rich html document.
 Easy to learn and writing content effectively for technical and non-technical people.
 Easy to test the content locally and easy to add/update and delete content
 Support for popular visual editors
 Extending syntax to provide custom elements like audio, video etc.
 Easily share these content between different devices.
 Markdown is standard for writing content, GitHub, GitLab and reddit is using heavily.
Features
Feature Description
Headings Headers
list Display the list of elements
Tables Tables
Comments Comments are ignore by parsers.
Links Connect to other Documents
Images Insert Pictures
blockquotes Define Quotations
Emphasis Emphasis content format

Extended Markdown features


Feature Description
MarkdownSharp extend Markdown syntax used by Stack overflow
GitHub Flavored Markdown GitHub markdown used to format and syntax highlight blocks

Language support
Tools convert to pdf/doc/html
Feature Description
pandoc tool to convert these content to pdf,word and HTML documents
GitHub Flavored Markdown GitHub markdown used to format and syntax highlight blocks
What is markdown used for?
It is a text content written in text format written by Content writers and convert into different
formats like html pdf. Input is plain easy readable text with special syntax, and application convert
to different format, Output is HTML, PDF or word. These can be create and written with any
popular text editors. It can be used in many ways
 Writing content in static generators and generates html content
 Can also used in Slack for communicating users with this special syntax
 JIra content can include this syntax for html elements

How do I open markdown files?


These files are opened in simple text editor or Integrated Development editors like Visual
Studio, Atom Sublime text, Notepad++ and Intelli IDEA in Windows, UNIX, and MAC OS.
Basic editors provides only capabilities to read and write without format and validation.
IDEs provides plugins following features
 Auto or manual Format the code
 Syntax highlight coloring
 Inline render of this content in HTML view
 Some plugins provides to export to PDF and Documents

Markdown - Block Elements


This tutorial covers Markdown format with examples. It covers syntax and examples for
line breaks, paragraphs, Line return. There are different block elements in markdown.
 line breaks
 paragraphs
 Line return
Paragraph line breaks in markdown
Paragraph is group of lines separated by line break. Line break can be applied by adding
new line or backslash \ symbol at end of line.

You might also like