#Data Science #Skills #Data Science Pipe Line
What is Data Science?
Wikipedia, the mother of all encyclopedias, defines Data Science as a field focused on
extracting knowledge and insights from data by using scientific methods. However, what it
doesn’t tell you, is that we humans are born data scientists. How? Let’s see.
You’re observing the world around you no matter what you’re doing. At every waking
moment, you’re taking in details from your surroundings and feeding it to your brain. You
then process these observations into data and use it to understand things around you by
finding out meanings and make predictions of what is likely to happen next.
When you’re late to leave for work by an hour, you call in to tell them you’ll be working from
home. You’re using your past observations of traffic and stoppages on the way that make
you conclude that you’re likely to lose your time stuck in traffic than you’d gain by being in
office. When you come into your room and see chocolate wrappers lying around, a casual
analysis will tell you that someone’s been eating your chocolates in your absence.
Top 4 Data Analytics Roles To Look Out For
In either of the mentioned cases, if you do these calculations and predictions in your mind,
without noting it down, you’re a normal human being. On the other hand, if you go ahead
and record these data points (of course in a machine-readable format) and then try to devise
an algorithm (or, procedures) and computer programs to run the application. If the output of
this “hypothetical” system is that “the traffic is going to suck”, or “your roommates ate your
chocolates”, then bingo! You’re a data scientist.
It’s just as simple (in theory) as the above analogy makes it sound. At the end of the day,
you have data, procedures, algorithms, and tools. You just need to extract knowledge from
it. To do that efficiently, there’s a workflow/pipeline you must follow. Let’s see what all is
included in a typical Data Science Pipeline.
Data Science Pipeline
Data science pipeline talks about the flow of the entire process – from obtaining the desired
data to make accurate calculations and predictions. Let’s have a look at the elements of this
pipeline:
Data Science Pipeline
Obtain Your Data
This is by default the first thing you need to do to practice Data Science – get the data! Just
a little heads-up – there are some things you must take into consideration while obtaining
your data. You must first identify all of your datasets (can be from the internet or
internal/external databases). You should then extract the data into a usable format (CSV,
XML, JSON, etc.)
Here are Top Skills & Tools to Master to Be a Data
Analysts.
Database Management: Either SQL or NoSQL, depending on your needs and requirements.
Querying these databases
Retrieving unstructured data in the form of videos, audios, texts, documents, etc.
Distributed storage: Hadoop, Apache Spark, or Apache Flink.
Scrubbing / Cleaning Your Data
Cleaning of the data should be given utmost importance because the final output of your
system is only as good as the data you put into it. Cleaning refers to removing anomalies,
filling in empty/missing values, seeing if the data is consistent, and other things of this
nature.
Skills Required:
Scripting language: Python, R, SAS
Data wrangling tools: Python Pandas, R
Distributed processing: Hadoop, MapReduce/Spark
Exploring (Exploratory Data Analysis)
Now that the data is clean, you will begin to understand what patterns your data has.
Different types of visualisations and statistical modelings come into use in this phase.
Basically, this phase aims to derive the hidden meaning from our data.
There’s a lot that goes around in the field of Exploratory Data Analysis. If you feel it’s
something you’d enjoy, don’t forget to read our article on the same.
To perform better in this phase, you need to have your “spidey senses” tingling. Go crazy
and spot weird patterns or trends – always be on the lookout for something out of the box.
However, while doing that, don’t forget the problem you’re aiming to solve. Don’t go too
much out of the box. Exploratory data analysis is an art, and an artist should always keep
the audience in mind.
Skills Required
Python libraries: Numpy, Matplotlib, Pandas, Scipy
R libraries: GGplot2, Dplyr
Inferential statistics
Data Visualisation
Experimental design
Top Steps to Mastering Data Science, Trust Me I’ve Tried Them!
Modeling (Machine Learning)
This is the fun part. Models are simply general rules in a statistical sense. A machine
learning model is simply a tool in your toolkit. You have access to so many algorithms with
different use-cases and objectives that a simple research will lead you to an algorithm that
fits your business needs.
After cleaning the data and finding out the essential features (in the EDA phase), using a
statistical model as a predictive tool will enhance your overall decision making. Instead of
looking back to see “what happened?”, predictive analytics aims to answer “what next?” and
“how should we go about it?”.
Skills Required
Machine Learning: Supervised/Unsupervised/Reinforcement learning algorithms
Evaluation methods
Machine Learning Libraries: Python (Sci-kit Learn) / R (CARET)
Linear algebra & Multivariate Calculus
Interpreting (Data Storytelling)
This is one of the more challenging tasks in the pipeline. Here, you aim to explain your
findings through communication. At the end of the day, it’s all about connecting with your
audience – and that is what makes storytelling a key.
Your findings are hardly useful if you are not able to convey its significance to the non-tech
bunch at your office, or even your boss, for that matter. A good practice to get things in
control would be to rehearse a lot. Try framing a story on your findings and telling it to a
layman (preferably a kid). If they understand it, so will your boss. And if they don’t, well, you
know what Einstein said:
“If you can’t explain it to a six-year-old, you don’t understand it yourself.”
This phase aims to derive true business insights. Your main challenge here is to visualize
your findings and display them in a beautiful and understandable way.
Skills Required
Knowledge of your business domain
Data Visualisation tools: Tableau, [Link], Matplotlib, GGplot, Seaborn, etc.
Thanks and regards
Pratap Malladi
Planning a Career in Data Science, ML, AI, Digital Marketing and PMP
Only Whatsapp me at +91 6301638012
[Link]