Assignment 1
Assignment 1
Q1. Present an example where data mining is crucial to the success of a business. What
data mining functions does this business need? Can they be performed alternatively by
data query processing or simple statistical analysis?
Ans. The following example depicts that data mining is crucial to the success of a
business:
A departmental store, can use data mining to assist with its target marketing mail
campaign.
Using data mining functions such as association, the store can use the mined strong
association rules to determine which products bought by one group of customers
are likely to lead to the buying of certain other products. With this information, the
store can then mail marketing materials only to those kinds of customers who exhibit
a high likelihood of purchasing additional products.
Data query processing is used for data or information retrieval and does not have
the means for finding association rules. Similarly, simple statistical analysis cannot
handle large amounts of data such as those of customer records in a department
store.
Q2.Outline the major research challenges of data mining in Finance and Marketing.
Ans. Data mining is not an easy task, as the algorithms used can get very complex and
data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this tutorial,
we will discuss the major issues regarding −
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Q3. How can Machine Learning be enhanced to improve prediction and modelling?
Explain.
Ans. My key observation for this discovery was that we humans are very bias to what kind
of information (i.e. seeing, hearing, feeling, tasting and smelling) we pay attention but
disregard the rest. That is why we don't get to see the whole picture and hence, remain
stuck in a confusing state of mind - sometimes for many years - because we tend to
neglect to consider almost all information outside the spectrum of our very subjective and
limited range of sensory perception. Since we humans tend not to be systematic in
selecting the data sources and dimensions of information (i.e. feature selection) without
even being aware of it, we need the help of less bias artificial intelligence.
The fact that we cannot perceive this information does not make it any less important or
relevant for our lives. That is what I realized when trying to get a better understanding of
the regulation of the aging process. My problem was - or actually still is - that I could not
find the kind of datasets, I'd need to test new hypotheses. This implies that nobody before
me seemed to have felt that collecting transcriptome, proteome, metabolome and epi-
genetic data every 5 minutes throughout the entire lifespan of the yeast, would be worth
the effort. We are totally capable of generating the data needed to advance in our
understanding of aging and many other complex and still obscure phenomena, but the
wet-lab scientists, who design our biological and medical experimental studies, don't
seem to be even aware of missing something very important.
A good example is the magnetic field. We humans tend to ignore it because we cannot
feel it. But, nevertheless, it affects our lives. It can cure depression. It can make your
thumbs move involuntarily. Some birds use it to navigate the globe on their seasonal flight
migration.
I am worried about that there are other fundamental phenomena similar to the magnetic
field, of which none of us is aware yet, because so far we have not tried to look for similar
imperceptible information carrying dimensions. For example, spiders, ants and bets are
blind. However, visible light affects their lives regardless whether or not they have a
concept of vision, since they have never experienced it. There could be other information
carrying dimensions that – like the light for the spiders, ants and bets – is an imperatively
hidden object (IHO), although it affects our lives so profoundly that we cannot
understand aging and many other complex phenomena without considering such kind of
information as well. That is why I recommend using artificial intelligence to reduce our
observational bias.
Often, scientific progress has been made by accident. The means that a mistake, which
changed the otherwise constant experimental environment in such a way that an
unexpected result or observation was the consequence, was what helped us
unexpectedly to finally make some progress. That is why I am proposing to intentionally
vary external experimental conditions, methods, measurements, study designs, etc., to
discover new features, which affect the outcome, much sooner.
Q4. What are proper machine learning algorithms to extract relationship among
variables?
Ans. Proper machine learning algorithms to extract relationship among variables are:
1. Linear Regression
These are probably the simplest algorithms in machine learning. Regression algorithms can
be used for example, when you want to compute some continuous value as compared
to Classification where the output is categoric. So whenever you are told to predict some
future value of a process which is currently running, you can go with regression algorithm.
Linear Regressions are however unstable in case features are redundant, i.e. if there is
multicollinearity
Some examples where linear regression can used are:
Time to go one location to another
Predicting sales of particular product next month
Impact of blood alcohol content on coordination
Predict monthly gift card sales and improve yearly revenue projections
2. Logistic Regression
Logistic regression performs binary classification, so the label outputs are binary. It takes
linear combination of features and applies non-linear function (sigmoid) to it, so it’s a very
small instance of neural network.
Logistic regression provides lots of ways to regularize your model, and you don’t have to
worry as much about your features being correlated, like you do in Naive Bayes. You also
have a nice probabilistic interpretation, and you can easily update your model to take in
new data, unlike decision trees or SVMs. Use it if you want a probabilistic framework or if
you expect to receive more training data in the future that you want to be able to quickly
incorporate into your model. Logistic regression can also help you understand the
contributing factors behind the prediction, and is not just a black box method.
Logistic regression can be used in cases such as:
Predicting the Customer Churn
Credit Scoring & Fraud Detection
Measuring the effectiveness of marketing campaigns
3. Decision trees
Single trees are used very rarely, but in composition with many others they build very
efficient algorithms such as Random Forest or Gradient Tree Boosting.
Decision trees easily handle feature interactions and they’re non-parametric, so you don’t
have to worry about outliers or whether the data is linearly separable. One disadvantage
is that they don’t support online learning, so you have to rebuild your tree when new
examples come on. Another disadvantage is that they easily over-fit, but that’s where
ensemble methods like random forests (or boosted trees) come in. Decision Trees can also
take a lot of memory (the more features you have, the deeper and larger your decision
tree is likely to be)
Trees are excellent tools for helping you to choose between several courses of action.
Investment decisions
Customer churn
Banks loan defaulters
Build vs Buy decisions
Sales lead qualifications
4. K-means
Sometimes you don’t know any labels and your goal is to assign labels according to the
features of objects. This is called clusterization task. Clustering algorithms can be used for
example, when there is a large group of users and you want to divide them into particular
groups based on some common attributes.
If there are questions like how is this organized or grouping something or concentrating on
particular groups etc. in your problem statement then you should go with Clustering.
The biggest disadvantage is that K-Means needs to know in advance how many clusters
there will be in your data, so this may require a lot of trials to “guess” the best K number of
clusters to define.
6. Neural networks
Neural Networks take in the weights of connections between neurons. The weights are
balanced, learning data point in the wake of learning data point. When all weights are
trained, the neural network can be utilized to predict the class or a quantity, if there
should arise an occurrence of regression of a new input data point. With Neural networks,
extremely complex models can be trained and they can be utilized as a kind of black
box, without playing out an unpredictable complex feature engineering before training
the model. Joined with the “deep approach” even more unpredictable models can be
picked up to realize new possibilities. E.g. object recognition has been as of late
enormously enhanced utilizing Deep Neural Networks. Applied to unsupervised learning
tasks, such as feature extraction, deep learning also extracts features from raw images or
speech with much less human intervention.
On the other hand, neural networks are very hard to just clarify and parameterization is
extremely mind boggling. They are also very resource and memory intensive.