Data Analyst Interview Questions
Data Analyst Interview Questions
1. Introduce Yourself.
Ans- Python is a High Level Object Oriented Interpreted Programming Language. It has 7
datatypes Integer, String, List, Tuple, Dictionary, Set, Boolean.
Ans-Tuple is a sequence type in Python which is immutable and fast while List is a
sequence type (dynamic array) in Python which is mutable and slow. Tuple is operated
from the cache memory of the computer while a list is executed from the primary memory
of the computer. Hence tuple is faster than a list.
There are two main goals of time series analysis: (a) identifying the nature of the
phenomenon represented by the sequence of observations, and (b) forecasting
(predicting future values of the time series variable). Both of these goals require that the
pattern of observed time series data is identified and more or less formally described.
Once the pattern is established, we can interpret and integrate it with other data (i.e., use
it in our theory of the investigated phenomenon, e.g., seasonal commodity prices).
Regardless of the depth of our understanding and the validity of our interpretation
(theory) of the phenomenon, we can extrapolate the identified pattern to predict future
events.
Suppose, you are the owner of an online grocery store. You sell 250 products online. A
conventional methodology has been applied to determine the price of each product. And,
the methodology is very simple – price the product at par with the market price of the
product.
You plan to leverage analytics to determine pricing to maximize the revenue earned.
Out of 100000 people who visit your website, only 5000 end up purchasing your products.
Now, all those who made a purchase, you have obtained their buying patterns, including
their average unit purchased etc.
If you make the calculation of the profit earned per customer who comes to your portal :
Total Profit Earned : $165
We will try to solve this case study both by a business approach and an analytical
approach.
To solve the problem without applying any technique, let’s analyze the frequency
distributions.
Profit Margin : Here is a distribution of profit margin for each of 250 products.
Until here, we have got three decision attributes for each product. The extreme of the
pricing is -10% and 20%. We need a -10% for products which can give a big acquisition
increment or high volume increment or both. For rest we need to increase the profit
margin by increasing pricing by +20%.
Obviously, this is a super simplistic way to solve this problem, but we are on a track of
getting low hanging fruits benefit.
Because the incremental acquisition has a mean at 0.4%, we know that if we decrease the
cost of products with high acquisition incremental rate, we will have significant
incremental overall sales, but quite less impact due to less profit margins.
For medium incremental acquisitions, we need to take decision on profit margins and
incremental volumes. All the cells shaded in green are the ones that will be tagged for -
10% and rest at 20% price increase. The decision here is purely intuitive taking into
account the basic understanding of revenue drivers for which we have seen the
distributions above.
9. Describe Your capstone project or In-house projects. Walkthrough the steps taken to
build the model and how did you improvise it.
10. Practice case studies which tend to improve the sales of a product in the
market/improve the profits earned.
12. Which classification algorithm are you comfortable with. Please explain the flow of the
algorithm with the help of a scenario or example.
13. What are the difficulties in working with regression models? What are their corrections?
Ans-
Multicollinearity occurs when independent variables in a regression model are
correlated. This correlation is a problem because independent variables should
be independent. If the degree of correlation between variables is high enough, it can cause
problems when you fit the model and interpret the results.
There are two basic kinds of multicollinearity:
The need to reduce multicollinearity depends on its severity and your primary goal for
your regression model. Keep the following three points in mind:
The severity of the problems increases with the degree of the multicollinearity. Therefore,
if you have only moderate multicollinearity, you may not need to resolve it.
Multicollinearity affects only the specific independent variables that are correlated.
Therefore, if multicollinearity is not present for the independent variables that you are
particularly interested in, you may not need to resolve it. Suppose your model contains
the experimental variables of interest and some control variables. If high multicollinearity
exists for the control variables but not the experimental variables, then you can interpret
the experimental variables without problems.
14. What is exploding gradients ?
Gradient is the direction and magnitude calculated during training of a neural network
that is used to update the network weights in the right direction and by the right amount.
“Exploding gradients are a problem where large error gradients accumulate and result in
very large updates to neural network model weights during training.” At an extreme, the
values of weights can become so large as to overflow and result in NaN values.This has
the effect of your model being unstable and unable to learn from your training data.
2. Load Balancer: A load balancer tries to distribute the workload (requests) across
multiple servers or instances in a cluster. The aim of the load balancer is to minimize the
response time and maximize the throughput by avoiding the overload on any single
resource. In the above image, the load balancer is public facing entity and distributes all
the requests from the clients to multiple Ubuntu servers in the cluster.
3. Nginx: Nginx is an open-source web server but can also be used as a load balancer.
Nginx has a reputation for its high performance and small memory footprint. It can thrive
under heavy load by spawning worker processes, each of which can handle thousands of
connections. In the image, nginx is local to one server or instance to handle all the requests
from the public facing load balancer. An alternative to nginx is Apache HTTP Server.
Architecture Setup
By now you should be familiar with the components mentioned in the previous section.
In the following section, Let’s understand the setup from an API perspective since this
forms the base for a web application as well.
Note: This architecture setup will be based on Python.
Development Setup
Train the model: The first step is to train the model based on the use case
using Keras or TensorFlow or PyTorch. Make sure you do this in a virtual environment,
as it helps in isolating multiple Python environments and also it packs all the necessary
dependencies into a separate folder.
Build the API: Once the model is good to go into an API, you can use Flask or Django to
build them based on the requirement. Ideally, you have to build Restful APIs, since it helps
in separating between the client and the server; improves visibility, reliability, and
scalability; it is platform agnostic. Perform a thorough test to ensure the model responds
with the correct predictions from the API.
Web Server: Now is the time to test the web server for the API that you have built.
Gunicorn is a good choice if you have built the APIs using Flask.
Load Balancer: You can configure nginx to handle all the test requests across all the
gunicorn workers, where each worker has its own API with the DL model. Refer
this resource to understand the setup between nginx and gunicorn.
Load / Performance Testing: Take a stab at Apache Jmeter, an open-source application
designed to load test and measure performance. This will also help in understanding
nginx load distribution. Alternative is Locust.
Production Setup
Cloud Platform: After you have chosen the cloud service, set up a machine or instance from
a standard Ubuntu image (preferably the latest LTS version). The choice of CPU machine
really depends on the DL model and the use case. Once the machine is running, setup
nginx, Python virtual environment, install all the dependencies and copy the API. Finally,
try running the API with the model (it might take a while to load all the model(s) based
on the number of workers that are defined for gunicorn).
Custom API image: Ensure the API is working smoothly and then snapshot the instance to
create the custom image that contains the API and model(s). Snapshot will preserve all
the settings of the application. Reference: AWS, Google, and Azure.
Load Balancer: Create a load balancer from the cloud service, it can be either public or
private based on the requirement. Reference: AWS, Google, and Azure.
Alternate platforms
There are other systems that provide a structured way to deploy and serve models in the
production and few such systems are as follows:
TensorFlow Serving: It is an open-source platform software library for serving machine
learning models. It’s primary objective based on the inference aspect of machine learning,
taking trained models after training and managing their lifetimes. It has out-of-the-box
support for TensorFlow models.