0% found this document useful (0 votes)
2 views

Unsupervised Learning

The document provides an overview of unsupervised learning, specifically focusing on clustering techniques used to group similar data points without a target variable. It discusses various types of clustering methods, including hard and soft clustering, and details different algorithms such as centroid-based, density-based, connectivity-based, and distribution-based clustering. Additionally, it highlights the advantages and disadvantages of K-means clustering, evaluation metrics for clustering performance, and practical applications in fields like the oil and gas industry.

Uploaded by

24mt0362
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unsupervised Learning

The document provides an overview of unsupervised learning, specifically focusing on clustering techniques used to group similar data points without a target variable. It discusses various types of clustering methods, including hard and soft clustering, and details different algorithms such as centroid-based, density-based, connectivity-based, and distribution-based clustering. Additionally, it highlights the advantages and disadvantages of K-means clustering, evaluation metrics for clustering performance, and practical applications in fields like the oil and gas industry.

Uploaded by

24mt0362
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Unsupervised Learning

Prepared By
Archana
AP/PE
IIT(ISM), Dhanbad
Clustering in Machine Learning
• In real world, not every data we work upon has a target variable. This
kind of data cannot be analyzed using supervised learning algorithms.
• We need the help of unsupervised algorithms. One of the most
popular type of analysis under unsupervised learning is Cluster
analysis.
• When the goal is to group similar data points in a dataset, then we
use cluster analysis.
What is Clustering ?
• The task of grouping data points based on their similarity with each
other is called Clustering or Cluster Analysis. This method is defined
under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised
learning we don’t have a target variable.
• Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric
like Euclidean distance, Cosine similarity, Manhattan distance, etc.
and then group the points with highest similarity score together.
• For Example, In the graph given below, we can clearly see that there
are 3 circular clusters forming on the basis of distance.
• Now it is not necessary that the clusters formed must be circular in
shape. The shape of clusters can be arbitrary. There are many
algortihms that work well with detecting arbitrary shaped clusters.
• For example, In the below given graph we can see that the clusters
formed are not circular in shape.
Types of Clustering
• Broadly speaking, there are 2 types of clustering that can be
performed to group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to
a cluster completely or not. For example, Let’s say there are 4 data
point and we have to cluster them into 2 clusters. So each data point
will either belong to cluster 1 or cluster 2.
• Soft Clustering: In this type of clustering, instead of assigning each
data point into a separate cluster, a probability or likelihood of that
point being that cluster is evaluated.
• For example, Let’s say there are 4 data point and we have to cluster
them into 2 clusters. So we will be evaluating a probability of a data
point belonging to both clusters. This probability is calculated for all
data points.
Types of Clustering Algorithms
• At the surface level, clustering helps in the analysis of unstructured data.
Graphing, the shortest distance, and the density of the data points are a
few of the elements that influence cluster formation.
• Clustering is the process of determining how related the objects are based
on a metric called the similarity measure.
• Similarity metrics are easier to locate in smaller sets of features. It gets
harder to create similarity measures as the number of features increases.
• Depending on the type of clustering algorithm being utilized in data
mining, several techniques are employed to group the data from the
datasets.
Various types of clustering algorithms are:
• Centroid-based Clustering (Partitioning methods)
• Density-based Clustering (Model-based methods)
• Connectivity-based Clustering (Hierarchical clustering)
• Distribution-based Clustering
1. Centroid-based Clustering (Partitioning methods)
• Partitioning methods are the most easiest clustering algorithms. They
group data points on the basis of their closeness.
• Generally, the similarity measure chosen for these algorithms are Euclidian
distance, Manhattan Distance or Minkowski Distance.
• Euclidean distance, for example, is a simple straight-line measurement
between points and is commonly used in many applications.
• Manhattan distance, however, follows a grid-like path, much like how you'd
navigate city streets.
• Squared Euclidean distance makes calculations easier by squaring the
values, while cosine distance is handy when working with text data
because it measures the angle between data vectors.
• Picking the right distance measure really depends on what kind of problem
you’re solving and the nature of your data.
• The datasets are separated into a predetermined number of clusters,
and each cluster is referenced by a vector of values.
• When compared to the vector value, the input data variable shows no
difference and joins the cluster.
• The primary drawback for these algorithms is the requirement that
we establish the number of clusters, “k,” either intuitively or
scientifically (using the Elbow Method) before any clustering machine
learning system starts allocating the data points.
• Despite this, it is still the most popular type of clustering. K-
means and K-medoids clustering are some examples of this type
clustering.
2. Density-based Clustering (Model-based methods)
• Density-based clustering, a model-based method, finds groups based on the density of
data points.
• Contrary to centroid-based clustering, which requires that the number of clusters be
predefined and is sensitive to initialization, density-based clustering determines the
number of clusters automatically and is less susceptible to beginning positions.
• They are great at handling clusters of different sizes and forms, making them ideally
suited for datasets with irregularly shaped or overlapping clusters.
• These methods manage both dense and sparse data regions by focusing on local density
and can distinguish clusters with a variety of morphologies.
• In contrast, centroid-based grouping, like k-means, has trouble finding arbitrary shaped
clusters.
• Due to its preset number of cluster requirements and extreme sensitivity to the initial
positioning of centroids, the outcomes can vary.
• Furthermore, the tendency of centroid-based approaches to produce spherical or convex
clusters restricts their capacity to handle complicated or irregularly shaped clusters.
• In conclusion, density-based clustering overcomes the drawbacks of centroid-based
techniques by autonomously choosing cluster sizes, being resilient to initialization, and
successfully capturing clusters of various sizes and forms. The most popular density-
based clustering algorithm is DBSCAN
3. Connectivity-based Clustering (Hierarchical clustering)
• A method for assembling related data points into hierarchical clusters is called
hierarchical clustering.
• Each data point is initially taken into account as a separate cluster, which is subsequently
combined with the clusters that are the most similar to form one large cluster that
contains all of the data points.
• Think about how you may arrange a collection of items based on how similar they are.
• Each object begins as its own cluster at the base of the tree when using hierarchical
clustering, which creates a dendrogram, a tree-like structure.
• The closest pairings of clusters are then combined into larger clusters after the algorithm
examines how similar the objects are to one another.
• When every object is in one cluster at the top of the tree, the merging process has
finished. Exploring various granularity levels is one of the fun things about hierarchical
clustering.
• To obtain a given number of clusters, you can select to cut the dendrogram at a particular
height. The more similar two objects are within a cluster, the closer they are. It’s
comparable to classifying items according to their family trees, where the nearest
relatives are clustered together and the wider branches signify more general
connections.
• There are 2 approaches for Hierarchical clustering:
• Divisive Clustering: It follows a top-down approach, here we consider
all data points to be part one big cluster and then this cluster is divide
into smaller groups.
• Agglomerative Clustering: It follows a bottom-up approach, here we
consider all data points to be part of individual clusters and then
these clusters are clubbed together to make one big cluster with all
data points.
4. Distribution-based Clustering
• Using distribution-based clustering, data points are generated and organized according to
their propensity to fall into the same probability distribution (such as a Gaussian,
binomial, or other) within the data.
• The data elements are grouped using a probability-based distribution that is based on
statistical distributions. Included are data objects that have a higher likelihood of being in
the cluster.
• A data point is less likely to be included in a cluster the further it is from the cluster’s
central point, which exists in every cluster.
• A notable drawback of density and boundary-based approaches is the need to specify
the clusters a priori for some algorithms, and primarily the definition of the cluster form
for the bulk of algorithms.
• There must be at least one tuning or hyper-parameter selected, and while doing so
should be simple, getting it wrong could have unanticipated repercussions. Distribution-
based clustering has a definite advantage over proximity and centroid-based clustering
approaches in terms of flexibility, accuracy, and cluster structure.
• The key issue is that, in order to avoid overfitting, many clustering methods only work
with simulated or manufactured data, or when the bulk of the data points certainly
belong to a preset distribution. The most popular distribution-based clustering algorithm
is Gaussian Mixture Model.
Advantages of K-means
1.Simple and easy to implement: The k-means algorithm is easy to
understand and implement, making it a popular choice for clustering
tasks.
2.Fast and efficient: K-means is computationally efficient and can
handle large datasets with high dimensionality.
3.Scalability: K-means can handle large datasets with many data points
and can be easily scaled to handle even larger datasets.
4.Flexibility: K-means can be easily adapted to different applications
and can be used with varying metrics of distance and initialization
methods.
Disadvantages of K-Means
1.Sensitivity to initial centroids: K-means is sensitive to the initial
selection of centroids and can converge to a suboptimal solution.
2.Requires specifying the number of clusters: The number of clusters k
needs to be specified before running the algorithm, which can be
challenging in some applications.
3.Sensitive to outliers: K-means is sensitive to outliers, which can have a
significant impact on the resulting clusters.
Different Evaluation Metrics for Clustering
• When it comes to evaluating how well your clustering algorithm is
working, there are a few key metrics that can help you get a clearer
picture of your results. Here’s a rundown of the most useful ones:
Silhouette Analysis
• Silhouette analysis is like a report card for your clusters. It measures
how well each data point fits into its own cluster compared to other
clusters.
• A high silhouette score means that your points are snugly fitting into
their clusters and are quite distinct from points in other clusters.
• Imagine a score close to 1 as a sign that your clusters are well-defined
and separated.
• Conversely, a score close to 0 indicates some overlap, and a negative
score suggests that the clustering might need some work.
Inertia
• Inertia is a bit like a gauge of how tightly packed your data points are within each
cluster.
• It calculates the sum of squared distances from each point to the cluster's center
(or centroid).
• Think of it as measuring how snugly the points are huddled together. Lower inertia
means that points are closer to the centroid and to each other, which generally
indicates that your clusters are well-formed.
• For most numeric data, you'll use Euclidean distance, but if your data includes
categorical features, Manhattan distance might be better.
Dunn Index
• The Dunn Index takes a broader view by considering both the distance within and
between clusters. It’s calculated as the ratio of the smallest distance between any
two clusters (inter-cluster distance) to the largest distance within a cluster (intra-
cluster distance).
• A higher Dunn Index means that clusters are not only tight and cohesive internally
but also well-separated from each other.
• In other words, you want your clusters to be as far apart as possible while being as
compact as possible.
How Does K-Means Clustering Work?
• The flowchart below shows how k-means clustering works:
• The goal of the K-Means algorithm is to find clusters in the given input
data. There are a couple of ways to accomplish this.
• We can use the trial and error method by specifying the value of K
(e.g., 3,4, 5). As we progress, we keep changing the value until we get
the best clusters.
• Another method is to use the Elbow technique to determine the value
of K.
• Once we get the K's value, the system will assign that many centroids
randomly and measure the distance of each of the data points from
these centroids.
• Accordingly, it assigns those points to the corresponding centroid from
which the distance is minimum.
• So each data point will be assigned to the centroid, which is closest to
it. Thereby we have a K number of initial clusters.
• It calculates the new centroid position for the newly formed clusters.
The centroid's position moves compared to the randomly allocated
one.
• Once again, the distance of each point is measured from this new
centroid point. If required, the data points are relocated to the new
centroids, and the mean position or the new centroid is calculated
once again.
• If the centroid moves, the iteration continues indicating no
convergence. But once the centroid stops moving (which means that
the clustering process has converged), it will reflect the result.
Visualization example to understand this better:
• We have a data set for a grocery shop, and we want to find out how
many clusters this has to be spread across. To find the optimum
number of clusters, we break it down into the following steps:
• Step 1:
• The Elbow method is the best way to find the number of clusters. The
elbow method constitutes running K-Means clustering on the dataset.
• Next, we use within-sum-of-squares as a measure to find the optimum
number of clusters that can be formed for a given data set. Within the
sum of squares (WSS) is defined as the sum of the squared distance
between each member of the cluster and its centroid.
• The WSS is measured for each value of K. The value of K, which has the
least amount of WSS, is taken as the optimum value.
• Now, we draw a curve between WSS and the number of clusters.

Here, WSS is on the y-axis and number of clusters on the x-axis.


You can see that there is a very gradual change in the value of WSS as
the K value increases from 2.
So, you can take the elbow point value as the optimal value of K. It
should be either two, three, or at most four. But, beyond that,
increasing the number of clusters does not dramatically change the
value in WSS, it gets stabilized.
Step 2:
• Let's assume that these are our delivery points:
Step 3:
• Now the distance of each location from the centroid is measured, and
each data point is assigned to the centroid, which is closest to it.
• This is how the initial grouping is done:
Step 4:
• Compute the actual centroid of data points for the first group.
Step 5:
• Reposition the random centroid to the actual centroid.
Step 6:
• Compute the actual centroid of data points for the second group.
Step 7:
• Reposition the random centroid to the actual centroid.
Step 8:
• Once the cluster becomes static, the k-means algorithm is said to be
converged.
• The final cluster with centroids c1 and c2 is as shown below:
K-Means Clustering Algorithm
• Let's say we have x1, x2, x3……… x(n) as our inputs, and we want to
split this into K clusters.
• The steps to form clusters are:
• Step 1: Choose K random points as cluster centers called centroids.
• Step 2: Assign each x(i) to the closest cluster by implementing
euclidean distance (i.e., calculating its distance to each centroid)
• Step 3: Identify new centroids by taking the average of the assigned
points.
• Step 4: Keep repeating step 2 and step 3 until convergence is achieved
• Unsupervised Machine Learning (ML) algorithms can be powerfully applied in clustering analysis.
• There are different types of clustering algorithms that are most commonly used. The idea behind
using a clustering algorithm is to cluster or partition the data.
• For instance, if dividing a producing gas well into its loaded versus unloaded condition is the
desirable outcome, a clustering algorithm can be used to make this division.
• One example of using an unsupervised clustering technique in the O&G industry is type curve
clustering.
• E&P operators use their knowledge of the area such as geologic features, production
performance, BTU content, proximity to pipeline, etc., to define each formation’s type curve area
and boundaries.
• As can be imagined, this process can be very difficult to comprehend, considering all the features
that would affect the outcome.
• Therefore, the power of data and unsupervised ML algorithms can be used to cluster like-to-like
or similar areas together.
• All of the aforementioned features can be used as the input features of the unsupervised ML
model, and the clustering algorithm will indicate to which cluster each row of data will belong.
• Another application of unsupervised algorithms in the O&G is lithology classification. Both
supervised and unsupervised ML models can be used for lithology classification.
• If unsupervised ML models are used for this purpose, the idea is to automate the process of
identifying geologic formations such as sandstone, limestone, shale, etc., using an automated
technique as opposed to manually having a geologist going through the process.
K-means clustering
• K-means is a powerful yet simple unsupervised ML algorithm used to
cluster the data into various groups.
• K-means clustering can also be used for outlier detection. One of the
biggest challenges of k-means clustering is identifying the optimum
number of clusters that must be used.
• Domain expertise plays a key role in determining the number of
clusters.
• For example, if the intent is to cluster the data into loaded versus
unloaded conditions for dry gas wells, the domain expertise instructs
one to use two clusters.
• On the other hand, if clustering is desired to be used to cluster type curve
regions, various number of clusters should be applied and the results
visualized before determining the number of clusters needed to define the
type curve boundaries.
• If a problem is supercomplex and determining the number of clusters is
simply not feasible, the "elbow" method can be used to get some
understanding of the potential number of clusters that will be needed.
• In this approach, k-means algorithm is run multiple times at various
clusters (2 clusters, 3 clusters, etc.).
• Afterward, plot "number of clusters" on the x-axis versus "within cluster
sum of squared errors" on the y-axis until an elbow point is observed.
• Increasing the number of clusters beyond the elbow point will not tangibly
improve the result of the k-means algorithm.
• Fig. 4.1 illustrates the elbow point or the optimum number of clusters that
occurred with 4 clusters.
• This method provides some insight into the number of clusters to choose,
but it is very crucial to test higher number of clusters in the event that
more granularity is needed.
• On the other hand, if your particular domain expertise indicates that
less number of clusters is needed, the domain expertise will
supersede the elbow point methodology.
• The notion of the elbow point technique is to provide guidance when
unsure of the number of clusters that will be needed, and it is not to
provide an exact solution to a problem.
• Please note that k-means algorithm’s goal is to select centroids that
will lead to minimizing "inertia." Another term used in lieu of "inertia"
is "within cluster sum of squared errors" and it is defined for cluster j
as follows:
• xi is referred to as the ith instance in cluster j and 𝜇j is referred to as
the mean of the samples or "centroid" of cluster j.
• Inertia essentially measures how internally consistent clusters are.
• Please note that a lower inertia number is desired. 0 is optimal;
however, in a high-dimensional problem, inertia could be high.
• After applying k-means clustering, visualization is the key to making
sure the desired clustering outcome is achieved.
• Another useful approach in using k-means clustering is clustering
(labeling) or partitioning the data prior to feeding the labeled data as
the output of a supervised ML algorithm.
How does K-means clustering work?
Understanding k-means clustering is actually very simple when broken down
into the step-by-step procedure. Therefore, let’s go over the step-by-step
procedure when applying k-means clustering to any data:
1) Standardize the data since similarities between features based on
distance measures are the key in k-means clustering. Hence, having
different scales can skew the result in favor of features with larger values.
2) Determine the number of clusters, using (i) elbow method, (ii) silhouette,
or (iii) hierarchical clustering. If unsure, write a "for loop" to calculate the
sum of squared errors versus number of clusters. Afterward, determine
the number of clusters that will be used.
3) The initialization of centroids within a data set can be either initialized
randomly or selected purposefully.
4) The default initialization method for most open-source ML software
including Python’s scikit learn library is random initialization. If random
initialization does not work, carefully selecting the initial centroids could
potentially help the model.
5) Next, find the distance between each data point (instance) and the
randomly selected (or carefully selected) cluster centroids. Afterward, assign
each data point to each cluster centroid based on the distance calculations
(euclidean is commonly used) presented below.
For example, if two centroids have been randomly selected within a data set,
the model will calculate the distance from each data point to centroid #1 and
#2.
In this case, each data point will be clustered under either centroid #1 or #2
based on the distance from each data point to randomly initialized cluster
centroids.
There are various ways to calculate the distance. This step can be
summarized as assigning each data point to each cluster centroid based on
their distance.
The ones closest to centroid #1 will be assigned to centroid #1 and the ones
closest to centroid #2 will be assigned to centroid #2. The following distance
functions are commonly used, the most common being the euclidean
distance function:
• where assuming each instance has n features, the distance of the ith
instance (xi) to centroid of cluster j (𝜇j) can be calculated.
• q represents the order of the norm.
• The scenario where q is equal to 1 represents Manhattan distance
and the case where q is equal to 2 represents Euclidean distance.
• To illustrate the euclidean distance calculation for a data set with 3
features, let’s apply the euclidean distance to the following two
vectors: (3,9,5) and (8,1,12)
5) Afterward, find the average value of instances assigned to each
cluster centroid (that was assigned in step 4) and recalculate a new
centroid for each cluster by moving the cluster centroid to the average
of instances’ average for each cluster.
6) Since new centroids have been created in step 5, in this step,
reassign each data point to the newly generated centroids based on
one of the distance functions.
7) Steps 5 and 6 are repeated until the model converges. This indicates
that additional iteration will not lead to significant modification in the
final centroid selection. In other words, cluster centroids will not move
any further.
• K-means is very sensitive to outlier points. Therefore, before applying
k-means clustering, make sure to investigate the outliers.
• This goes back to one of the first steps in applying any ML algorithm
which was data visualization.
• If the outliers are invalid, make sure to remove them prior to using k-
means algorithm.
• K-means requires the number of clusters to be defined. This could
also be classified as a disadvantage.
• To visualize the distribution of each parameter, use the code below
and change the column name to plot each feature.
• Next, let’s plot a heat map of all parameters versus one another to
find potential collinear features.
• As shown in Fig. 4.5, TOC and bulk density have a negative Pearson
correlation coefficient of 0.99.
• This makes sense as TOC is a calculated feature derived from bulk
density. Therefore, let’s use the lines of code below to remove TOC
from the analysis because TOC and bulk density would provide the
same information when clustering.
• The lines of code below will permanently drop the TOC from the "df"
data frame.
• Please remember to use"inplace =True" when the desired column
removal outcome is intended to be permanent.
• Next, let’s import the StandardScaler library and standardize the data
prior to feeding the data into the k-means algorithm:
• Next, let’s import the k-means library and write a for loop for
calculating within cluster sum of squared errors.
• Afterward, let’s use the matplotlib library to plot number of clusters
on the x-axis versus within cluster sum of squared errors:
• n_clusters represents the number of clusters that is desirable to be chosen when
applying k-means clustering.
• In this example, since the goal is to plot various number of clusters from 1 to 20,
n_clusters was set to "i" and "i" is defined as a range between 1 and 21.
• The term "init" refers to a method for initialization that can be set to "random"
that will randomly initialize the centroids.
• A more desirable approach that was used in this example is called "k-meansþþ"
which, according to the scikit library definition, refers to selecting the initial
cluster centers in an intelligent way to speed up convergence.
• Using "k-meansþþ" initializes the centroids to be far from one another which
could potentially lead to better results than random initialization (Clustering,
n.d.).
• "max_iter" refers to the maximum number of iterations for a single run.
"random_state" was set to 1000 to have a repeatable approach in determining
the random number generation for centroid initialization.
• In other words, the outcome of k-means will be the same if run again, since the
same seed number is being used to generate the random number. The default
value for "n_jobs" is 1 which means 1 processor will be used when k-means
clustering is performed and run.
• If -1 is chosen, all available processors will be used.
• Please note that choosing "n_jobs=1" could result in CPU hogging by
Python and other tasks will be less responsive as a result; therefore,
determine the number of processors that your computer can handle
and choose the n_jobs accordingly.
• In the for loop above, an instance of k-means clustering class is
created with the defined arguments and assigned to variable "km".
• Afterward, method "km.fit" was called with argument "df_Scaled“
(which is the standardized data).
• Next, the inertia results have been appended to the empty list called
"distortions" that was initially defined.
• The rest of the code is simply plotting x and y in a line plot. To get a
list of inertia numbers, simply type "print(distortions)".
• The purpose of the elbow technique is to find the number of desired clusters to choose.
• In this synthetically generated geologic data set, there is no prior knowledge on the number of clusters to
choose. Therefore, from the elbow point shown in Fig. 4.7, 10 clusters were chosen.
• As illustrated from the distortions, as the number of clusters increases, the difference in inertia value between
the current and prior cluster point decreases.
• Please note that 10 clusters is not an exact solution and more or less clusters could be selected to see the
cluster distribution across these wells as a function of number of clusters.
• The next step is to assume 10 clusters and proceed with the next phase
of labeling the data set and obtaining the centroids for each cluster.
• In the code below, the same assumptions (random_state =1000,init = 'k-
meansþþ', n_init =1000, max_iter = 500) were used with 10 clusters.
• Afterward, "kmeans.cluster_centers_" is used to obtain the cluster
centroids for each of the 10 clusters and 6 geologic features. Note that
these centroids are the standardized version and it must be converted
back to its original form using an inverse transform to make sense.
• The next step is to obtain the labels for each well. Simply call
"kmeans.labels_" as follows

• Next, let’s convert "df_scaled" from an array to a data frame and add
the labeled clusters per well to that data frame as follows:
• Next, let’s return the data to its original (unstandardized form) by
multiplying each variable by the standard deviation of that variable
and adding the mean of that variable as illustrated below.
• Note that "scaler.inverse_transform()" in scikit-learn could have also
been used to transform the data back to its original form.
• Please ensure the codes listed below are continuous when you
replicate them in Jupyter Notebook.
• For example, "df_scaled['Water Saturation, fraction']" is split into two
lines in the code shown below due to space limitation. Therefore,
ensure to have continuous code lines to avoid getting an error.
• As illustrated in Fig. 4.12, each cluster centroid represents the average
of each feature’s average.
• For example, cluster 3 (since indexing starts with 0) has an average GR
of 154.422 API, a bulk density of 2.238 g/cc, a resistivity of 15.845 U-
m, a water saturation of 18.3627%, a Phi*H of 20.907 ft, and a TVD of
9672.233 ft.
• The next step is to understand the number of counts per each cluster.
Let’s use the following lines of code to obtain it
• The last step in type curve clustering is to plot these wells based on
their latitude and longitude on a map to evaluate the clustering
outcome.
• In addition, the domain expertise plays a key role in determining the
optimum number of clusters to successfully define the type curve
regions/boundaries.
• For example, if there are currently 10 type curve regions within your
company’s acreage position, 10 clusters can be used as a starting
point to evaluate kmeans clustering’s outcome.
• For this synthetic data set, the last step of plotting and evaluating the
clustering’s outcome is ignored. However, please make sure to always
visualize the clustering outcome and adjust the selected number of
clusters accordingly.

You might also like