📊 PYTHON + AI TIP
🧮 How Does the Machine Learn by Proximity – Mathematics and Implementation of KNN (K Nearest Neighbors)

📊 PYTHON + AI TIP 🧮 How Does the Machine Learn by Proximity – Mathematics and Implementation of KNN (K Nearest Neighbors)

📰 Edição #52 — PYTHON + AI TIP - How Does the Machine Learn by Proximity – Mathematics and Implementation of KNN (K Nearest Neighbors)

Article content

🎯 1. OBJECTIVE

Understand how the KNN algorithm classifies new data points by comparing the manual formula implementation with the Python built-in function, including:

  • Mathematical formula with variable explanations
  • Python function and explanation
  • Machine learning process step-by-step
  • Realistic practical application


🧠 2. CONCEPT

KNN (K Nearest Neighbors) is a supervised learning algorithm based on geometric proximity. Its principle is simple:

  • When receiving a new data point, KNN does not perform model fitting or prior training.
  • It stores the training data and, at prediction time, calculates the distance between the new point and all known points.
  • Then, it selects the k closest points (neighbors) and determines the most frequent class among them to assign to the new point.

📍 Conceptual summary: ➔ KNN classifies by spatial similarity, working like a “memory consultant” that decides based on proximity to past examples.


🗂️ 3. REAL STUDY CASE SCENARIO

Imagine you work in a retail store segmenting customers by:

  • Spending Score (monthly spending)
  • Visit Frequency (visits per month)

Your goal is to classify a new customer to recommend targeted promotions.


📝 4. MATHEMATICAL FORMULA AND VARIABLES

🔢 Euclidean Distance Formula

Article content

dist(p₁, p₂) = √((x₁ − x₂)² + (y₁ − y₂)²)

➔ Variable explanations:

  • x1, y1: Coordinates of the new customer
  • x2, y2: Coordinates of a training customer
  • dist: Euclidean distance between p1 and p2

📍 Interpretation: Measures geometric proximity, the foundation of KNN decisions.


🛠️ 5. PYTHON FUNCTION THAT AUTOMATES THE FORMULA

🔧 Library: sklearn.neighbors.KNeighborsClassifier

➔ What it does:

  • Implements KNN efficiently
  • Stores data with .fit()
  • Calculates distances and performs voting with .predict()

➔ Why use it:

  • Avoids manual calculations
  • Scales for large datasets
  • Produces standardized, reliable results


💻 6. COMPLETE PYTHON SCRIPT – MANUAL VS BUILT-IN FUNCTION

# 🧠 KNN: Manual calculation vs sklearn function implementation
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier


# 🔢 Training data (customer segments)
# Each row represents [Spending Score, Visit Frequency]
X_train = np.array([[15, 2], [18, 3], [21, 3], [30, 8], [35, 10]])
y_train = np.array(['Low', 'Low', 'Low', 'High', 'High'])

# 🎯 New customer to classify
x_new = np.array([[20, 4]]) 

# ⚙️ Manual Euclidean distance function
def euclidean_distance(p1, p2):
    # Calculates the geometric distance between two points
    return np.sqrt(np.sum((p1 - p2) ** 2))
 
# 🧠 Manual KNN implementation
def knn_predict(x_new, X_train, y_train, k=3):
    # 1. Calculate distances from x_new to each training point
    distances = [euclidean_distance(x_new[0], x) for x in X_train]
   
    # 2. Sort distances and get indices of k nearest neighbors
    k_indices = np.argsort(distances)[:k]

    # 3. Retrieve the labels of the k nearest neighbors
    k_labels = y_train[k_indices]
   
    # 4. Perform majority voting to predict the class
    return Counter(k_labels).most_common(1)[0][0] 

# 🔍 Prediction using manual KNN
pred_manual = knn_predict(x_new, X_train, y_train, k=3)
print(f"[Manual KNN] Predicted class: {pred_manual}") 

# 📊 Plotting the manual method result
plt.scatter(X_train[:,0], X_train[:,1],
            c=['blue' if label=='Low' else 'red' for label in y_train],
            label='Training Data')
plt.scatter(x_new[:,0], x_new[:,1],
            c='green', marker='*', s=200, label='New Customer')
plt.title('Manual KNN Prediction')
plt.xlabel('Spending Score')
plt.ylabel('Visit Frequency')
plt.legend()
plt.show() 

# 🛠️ Using sklearn KNeighborsClassifier implementation
knn = KNeighborsClassifier(n_neighbors=3) 

# Fitting the model with training data
knn.fit(X_train, y_train)

# Predicting the class of the new customer
pred_sklearn = knn.predict(x_new)
print(f"[Sklearn KNN] Predicted class: {pred_sklearn[0]}")

# 📊 Plotting the sklearn method result
plt.scatter(X_train[:,0], X_train[:,1],
            c=['blue' if label=='Low' else 'red' for label in y_train],
            label='Training Data')
plt.scatter(x_new[:,0], x_new[:,1],
            c='purple', marker='*', s=200, label='New Customer')
plt.title('Sklearn KNN Prediction')
plt.xlabel('Spending Score')
plt.ylabel('Visit Frequency')
plt.legend()
plt.show()        

🧠 7. DETAILED MACHINE LEARNING PROCESS EXPLANATION

🔍 Line-by-line explanation:

➔ 6.1 distances = [euclidean_distance(x_new[0], x) for x in X_train]

Calculates distances from the new point to all training points (magic moment: determines proximity).

➔ 6.2 k_indices = np.argsort(distances)[:k]

Sorts distances and selects indices of k nearest neighbors (defines who votes).

➔ 6.3 k_labels = y_train[k_indices]

Extracts class labels of neighbors (prepares for final decision).

➔ 6.4 Counter(k_labels).most_common(1)[0][0]

Performs majority vote (final decision step).

➔ 6.5 sklearn .fit() and .predict()

Automates the entire process (magic: automated proximity + voting).


🧩 8. MOMENT OF LEARNING – Where the AI Actually “Decides”

The learning in KNN occurs at prediction time, unlike traditional models that perform training to adjust weights and minimize loss functions. In KNN:

  • There is no formal training phase.
  • The model simply stores the training data in memory and uses this historical dataset to classify new points.

🔍 Code lines evidencing the learning moment

 Line 1 – Distance Calculation

 distances = [euclidean_distance(x_new[0], x) for x in X_train]

 ✔ What it does: Calculates the Euclidean distance from the new customer to each training point. ✔ Why it matters: This is the first decision step. Here, the model “observes” how close each example is to the new data point. ✔ Technical concept: Geometric proximity. KNN entirely depends on the chosen distance metric (e.g., Euclidean, Manhattan) to base its classification.

 Line 2 – Selecting the k Nearest Neighbors

 k_indices = np.argsort(distances)[:k]

 ✔ What it does: Sorts distances in ascending order and selects the indices of the k closest neighbors. ✔ Why it matters: Defines who will vote in the final classification. ✔ Technical concept: This step determines the local decision space – the sample is classified based only on its nearest neighbors, not the entire dataset.

 Line 3 – Extracting Neighbor Classes

 k_labels = y_train[k_indices]

 ✔ What it does: Retrieves the labels (classes) of the k selected neighbors. ✔ Why it matters: Forms the basis for the majority vote. ✔ Technical concept: The model uses memory from the training examples, reinforcing KNN’s definition as an instance-based learning model.

 Line 4 – Majority Voting (Final Decision)

 return Counter(k_labels).most_common(1)[0][0]

 What it does: Counts the frequency of each class among the neighbors and returns the most common one. ✔ Why it matters: This is the “magic moment” of KNN learning, where the actual classification decision is made. ✔ Technical concept: KNN’s learning is not about parameter optimization but about efficiently querying stored data and deciding by similarity.

 🧠 Conceptual Summary

 🔹 KNN learning is considered lazy learning because:

  • It does not train in advance (no model fitting).
  • It decides at prediction time by calculating distances and voting based on stored examples.
  • All of the model’s intelligence is concentrated in these steps: calculating distances, selecting neighbors, and majority voting.


 ✅ Practical Conclusion

 KNN learns by direct comparison, without abstracting functions or creating complex mathematical generalizations.

➔ Its strength lies in its simplicity and geometric clarity, making it excellent for small datasets where spatial relationships between points are clear.


🗂️ 9. TECHNICAL SUMMARY – SCRIPT EXPLANATION TABLE

Line of Code and Function

  • def euclidean_distance(p1, p2): ➔ Defines distance function (fundamental for proximity logic).
  • distances = [euclidean_distance(x_new[0], x) for x in X_train] ➔ Calculates distances to all points (magic: determines closeness).
  • k_indices = np.argsort(distances)[:k] ➔ Sorts and finds k closest neighbors (defines neighborhood).
  • k_labels = y_train[k_indices] ➔ Extracts class labels of neighbors (prepares for voting).
  • Counter(k_labels).most_common(1)[0][0] ➔ Voting and final decision (magic: predicted class).
  • knn = KNeighborsClassifier(n_neighbors=3) ➔ Initializes sklearn model.
  • knn.fit(X_train, y_train) ➔ Stores training data.
  • knn.predict(x_new) ➔ Runs prediction (magic: automated proximity + voting).


📍 Important observations:

  • 📌 KNN does not learn in advance – it reacts in real time.
  • 🧭 Decision is based purely on closeness.
  • 🧠 Learning is implicit, by spatial reasoning.
  • 🕵️♂️ Noisy or imbalanced data may hinder generalization.


📌 10. WHEN TO USE THIS TYPE OF LEARNING?

Ideal when:

  • ✅ You have small or simple datasets
  • ✅ You want minimal preprocessing
  • ✅ Classes are clearly separated by geometry


 🔎 11. VISUAL INTERPRETATION

🟢 1. Graph – Manual KNN Prediction

  • Visualization: Displays training data points in blue (class Low) and red (class High). The new customer is marked with a green star.
  • Interpretation: The new customer at (20,4) was classified as Low. ➔ This result comes from the manual calculation of Euclidean distances, where its 3 nearest neighbors belong to class Low.

Article content

  • Learning Insight: Demonstrates correct implementation of manual KNN, reinforcing the geometric concept of proximity-based classification.


🟣 2. Graph – Sklearn KNN Prediction

  • Visualization: Shows the same training data with the new customer represented by a purple star.
  • Interpretation: The sklearn implementation also classified the new customer as Low. ➔ Confirms that the library function produces the same output as the manual method.

Article content

  • Learning Insight: Highlights that sklearn automates distance calculations, neighbor selection, and majority voting internally for efficiency in production.


💻 3. VSCode Terminal – Script Execution Results

  • First execution (manual_knn.py): Prints: The predicted class for point [3 4] is: A ➔ Demonstrates an earlier example using different data.
  • Second execution (knn_manual_function.py): Prints: [Manual KNN] Predicted class: Low [Sklearn KNN] Predicted class: Low
  • Conclusion: ✔ Both manual implementation and sklearn function return the same predicted class (Low). ✔ Confirms correctness of the manual implementation and reliability of the sklearn classifier for real-world use.

Article content

✅ Overall Summary

  • Manual implementation builds foundational understanding of KNN.
  • Sklearn provides a scalable, production-ready approach with identical logic.


🛠️ 12. PRACTICAL APPLICATIONS

  • Customer segmentation
  • Medical diagnosis
  • Pattern recognition
  • Recommendation systems


💡 13. EXTRA TIP

Test both methods side by side to strengthen your understanding and ensure consistency in your models.


📅 14. CTA – Follow & Connect

💼 LinkedIn & Newsletters: 👉 https://www.linkedin.com/in/izairton-oliveira-de-vasconcelos-a1916351/ 👉 https://www.linkedin.com/newsletters/scripts-em-python-produtividad-7287106727202742273 👉 https://www.linkedin.com/build-relation/newsletter-follow?entityUrn=7319069038595268608

💼 Company Page: 👉 https://www.linkedin.com/company/106356348/

💻 GitHub: 👉 https://github.com/IOVASCON


 


To view or add a comment, sign in

Others also viewed

Explore topics