Project
Project
Project Report on
Bachelor of Engineering
in
Information Science and Engineering
Submitted by
TUNEER SAHA 1DS21ET099
YESHWANTH REDDY 1DS21ET111
SHIVKUMAR KHOT 1DS22ET418
PRAKASH R 1DS22ET412
CERTIFICATE
Certified that the project report entitled “OBJECT DETECTION USING YOLOv10 ML MODEL
institution affiliated to VTU, Belagavi in partial fulfillment for the award of Degree of Bachelor of
Electronics and Telecommunication Engineering during the year 2024-2025. It is certified that all
corrections/suggestions indicated for Internal Assessment have been incorporated in the report deposited
in the departmental library. The project report has been approved as it satisfies the academic requirements
1. ........................................... ..........................................
2. ........................................... ..........................................
DAYANANDA SAGAR COLLEGE OF ENGINEERING
(An Autonomous Institute affiliated to Visvesvaraya Technological University (VTU), Belagavi,
Approved by AICTE and UGC, Accredited by NAAC with ‘A’ grade & ISO 9001 – 2015 Certified Institution)
Shavige Malleshwara Hills, Kumaraswamy Layout, Bengaluru-560 111, India
DECLARATION
We further declare that we have not submitted this report either in part or in full to any other university
for the award of any degree.
PLACE:
DATE:
ACKNOWLEDGEMENT
The satisfaction and euphoria accompanying the successful completion of any task would be incomplete
without the mention of people who made it possible and under constant guidance and encouragement the
task was completed. We sincerely thank the Management of Dayananda Sagar College of
Engineering, Bengaluru.
We express our sincere regards and thanks to Dr. B G Prasad, Principal, Dayananda Sagar College
of Engineering, Bengaluru. His constant encouragement guidance and valuable support have been an
immense help in realizing this technical seminar.
We express our sincere regards and thanks to Dr. Annapurna P Patil, Professor & Head, Department
of Information Science and Engineering, Dayananda Sagar College of Engineering, Bengaluru. Her
incessant encouragement guidance and valuable technical support have been an immense help in realizing
this project. Her guidance gave us the environment to enhance our knowledge, and skills and to reach the
pinnacle with sheer determination, dedication, and hard work.
We would like to express profound gratitude to my guide Guide name, designation, Department of
Information Science and Engineering, Dayananda Sagar College of Engineering, Bengaluru who
has encouraged us throughout the project. His/Her moral support enabled us to complete my work
successfully.
We express our sincere thanks to Project Coordinator Dr. Vaidehi M, Assoc. Prof, and Dr. Bhavani K
Asst. Prof. of the Department of Information Science and Engineering for their continues support
and guidance. We thank all teaching and non-teaching staff of the Department of Information Science
and Engineering for their kind and constant support throughout the academic Journey.
Object detection is a cornerstone of computer vision, enabling systems to identify and localize objects
within images or video streams. YOLOv10, the latest iteration of the "You Only Look Once"
framework, is celebrated for its balance of speed and accuracy, making it suitable for real-time
applications. This project presents the practical implementation and deployment of a YOLOv10-based
object detection system on Amazon Web Services (AWS), with the specific outcome of streaming
video from the server to users' phones, computers, and laptops via a provided address.
The deployment involves hosting the application on an AWS EC2 instance, where a Flask backend is
deployed using Gunicorn for handling API requests. CloudFront is integrated as a Content Delivery
Network (CDN) to enhance performance and reduce latency, while AWS ACM provides SSL
certificates for secure communication. Route 53 manages DNS hosting to ensure reliable domain
resolution. The backend leverages the Ultralytics library and OpenCV (cv2) for model inference, while
Flask and jsonify facilitate API responses.
This architecture enables seamless video streaming to multiple devices, showcasing YOLOv10's
effectiveness in real-world scenarios. Applications range from surveillance and security systems to e-
commerce and interactive experiences. Future work suggests exploring edge computing to improve
real-time responsiveness further, ensuring scalability and performance in diverse environments.
Keywords: YOLOv10, Object Detection, AWS EC2, CloudFront CDN, AWS Route 53, Scalable Cloud
Infrastructure
Table of Contents
ABSTRACT iii
ACKNOWLEDGMENT iv
LIST OF TABLES vii
LIST OF FIGURES viii
LIST OF ABBREVIATIONS AND SYMBOLS ix
1. INTRODUCTION…………………………………………………………………………1
1.1 Overview
1.2 Problem Statement
1.3 Objectives
1.4 Motivation
2. LITERATURE SURVEY…………………………………………………22
3. PROBLEM ANALYSIS & DESIGN………………………………………………… 40
3.1 Analysis
3.2 Hardware Requirements
3.3 Software Requirements
3.4 System Architecture Diagram
3.5 Data flow Diagram
3.6 Use Case Diagram
3.7 Sequence Diagram
4. IMPLEMENTATION
4.1 Overview of System Implementation
4.2 Module Description
4.3 Algorithms
4.4 Code Snippets
5. TESTING
5.1 Unit Test Cases
5.2 Integration Test Cases
6. RESULTS
6.1 Results and Analysis
7. CONCLUSION AND FUTURE SCOPE
7.1 Conclusion
7.2 Future Scope
REFERENCES
PUBLICATION DETAILS
PLAGIARISM REPORT
APPENDIX (IF ANY)
LIST OF FIGURES
INTRODUCTION
Real-time object detection plays a significant role in various domains like, video surveillance,
computer vision, autonomous driving and the operation of robots. Object detection is widely
used to count objects in a scene, track their precise locations and accurately label the objects. It
seeks to answer where is the object? , And What is the object ?.
Image localization refers to the process of finding a single object in an image, while
object detection refers to the process of finding several objects in an image. The Objective is to
detect objects using You Only Look Once (YOLO) approach and deploy it using AWS. This
method has several advantages as compared to other object detection algorithms.
YOLO algorithm has emerged as a well-liked and structured solution for real-time object
detection due to its ability to detect items in one operation through the neural network. Tasks like
recognition, detection, localization, or finding widespread applicability in the best real-world
scenarios, make object detection a crucial subdivision of computer vision. We can identify the
things in the frame thanks to the high accuracy of the YOLO model.
For the project involving YOLOv10-based object detection deployed on AWS, the integration
of various technologies plays a crucial role in ensuring performance, scalability, and ease of use.
The object detection model is hosted on an AWS EC2 instance, which provides the necessary
compute power to run the YOLOv10 model efficiently. Flask is used as the backend framework
to handle API requests, facilitating communication between the frontend and the model. To
manage incoming traffic and improve response times, Gunicorn is deployed as a WSGI server,
handling multiple requests concurrently.
For global content delivery and to minimize latency, Amazon CloudFront serves as a Content
Delivery Network (CDN), caching the video stream and reducing the load on the server. The
jsonify function in Flask ensures that data is easily returned in JSON format, making it suitable
for API-based responses. To manage the application's domain and route user traffic reliably,
AWS Route 53 is used for DNS management, ensuring users can access the application through
a custom domain with low latency.
Flask: Flask is a lightweight, Python-based web framework widely used for developing web
applications and APIs. Its simplicity and flexibility make it ideal for projects requiring rapid
development and easy scalability. Flask supports extensions for advanced functionalities like
database integration, authentication, and API handling. It is particularly effective for deploying
machine learning models, where frameworks like Gunicorn can be used to handle high-traffic
requests. Flask’s seamless integration with libraries like OpenCV and Ultralytics allows for
efficient implementation of computer vision tasks, such as YOLOv10-based object detection.
This makes Flask a popular choice for real-time applications requiring secure, scalable, and low-
latency solutions.
AWS: Amazon Web Services (AWS) is a comprehensive cloud platform that provides on-
demand computing resources, storage, and services to build and deploy applications at scale.
Among its services, Amazon EC2 (Elastic Compute Cloud) stands out as a virtual server
solution that offers scalable compute capacity in the cloud.EC2 instances allow users to run
applications with customized configurations, choosing from various operating systems, storage
options, and instance types tailored to specific performance needs. With features like auto-
scaling, load balancing, and secure networking, EC2 is ideal for hosting applications such as web
servers, databases, or machine learning models. Its flexibility and pay-as-you-go pricing make it
a key component for deploying robust, scalable, and efficient cloud-based solutions. CloudFront
- Amazon CloudFront is a Content Delivery Network (CDN) service that distributes content
globally with low latency and high transfer speeds. By caching content at edge locations
worldwide, it accelerates the delivery of static and dynamic web content, such as HTML, images,
videos, and API responses. CloudFront can be integrated with other AWS services like S3 for
storage or EC2 for serving dynamic content, improving performance and reducing latency for
users regardless of their geographical location.
Gunicorn: Gunicorn (Green Unicorn) is a Python WSGI (Web Server Gateway Interface) HTTP
server that is widely used to serve Python web applications, particularly those built with
frameworks like Flask and Django. It acts as an intermediary between the web server (like
Nginx) and the application, handling incoming HTTP requests and passing them to the Python
application for processing. Gunicorn is known for its performance and scalability, supporting
multiple worker processes to handle multiple requests simultaneously. This makes it ideal for
production environments where handling a high volume of requests with minimal latency is
crucial.
jsonify: In Flask, jsonify is a function used to easily convert Python objects (like dictionaries or
lists) into JSON format. It sets the correct MIME type (application/json) and makes it convenient
to return JSON responses from Flask routes, which is particularly useful for building APIs. It
simplifies the process of structuring data for client-side applications, making it ideal for web
applications that need to send structured data between the frontend and backend.
Python: Python is a high-level, interpreted programming language known for its readability,
simplicity, and versatility. It supports multiple programming paradigms, including procedural,
object-oriented, and functional programming. Python has a rich standard library and a large
ecosystem of third-party libraries, making it ideal for a wide range of applications, from web
development and data analysis to machine learning and automation. Python's ease of use and
community support have made it one of the most popular programming languages globally.
1.1 Overview
Object detection has become an essential technology in computer vision, enabling machines
to identify and locate objects within images or videos. This project aims to develop a robust,
scalable, and real-time object detection application using the YOLOv10 (You Only Look
Once) model. By leveraging state-of-the-art machine learning techniques and cloud
infrastructure provided by AWS, the system offers efficient processing and seamless
deployment.
The application is designed to detect objects in uploaded images, annotate them with
bounding boxes, and return results via a web interface. Technologies like Flask, AWS, and
Gunicorn are integrated to ensure high performance and scalability.
Key Features
Technologies Used
1. YOLOv10 Model:
o A pre-trained YOLOv10 model is used for its speed and accuracy in object
detection tasks. It processes the image in a single pass, delivering high
performance.
2. Flask:
o A lightweight Python web framework handles the backend server and manages
API routes.
3. AWS (Amazon Web Services):
o EC2 Instance: Hosts the application and provides compute resources.
o CloudFront (optional): Enhances content delivery speed using a Content
Delivery Network (CDN).
4. Gunicorn:
o A production-ready WSGI server that ensures efficient handling of concurrent
user requests.
5. Python Libraries:
o OpenCV: For image decoding, preprocessing, and annotation.
o NumPy: For numerical computations and image array manipulations.
o Ultralytics YOLO API: For easy integration and inference using YOLOv10.
6. Frontend Technologies:
o HTML/CSS/JavaScript: Power the user interface, enabling image uploads and
result display.
o AJAX: Enables asynchronous communication between the frontend and backend.
Application Workflow
1. Image Upload:
o Users upload an image through the web interface. JavaScript converts the image
into a Base64-encoded string for transmission.
2. Backend Processing:
o Flask receives the encoded image via the /process_frame API endpoint.
o The image is decoded and passed to the YOLOv10 model for inference.
3. Object Detection:
o YOLOv10 detects objects, their bounding boxes, and confidence scores.
o Results are drawn as bounding boxes on the image using OpenCV.
4. Result Transmission:
o The processed image is re-encoded into Base64 format.
o JSON response contains the image and detection details, which are sent back to
the frontend.
5. Frontend Display:
o The processed image is displayed with bounding boxes and detection details.
Deployment on AWS
The application is deployed on AWS to ensure high availability, scalability, and performance.
Key AWS services used include:
EC2 Instance:
o Hosts the Flask application and the YOLOv10 model.
CloudFront :
o Optimizes content delivery for global users.
Gunicorn: For efficient request handling.
HTTPS: For secure data transmission.
Object detection is a critical aspect of modern computer vision, enabling systems to identify and
locate objects within images or videos. This technology has applications in diverse fields,
including surveillance, retail, healthcare, and autonomous vehicles. Despite its potential,
implementing an effective and scalable object detection system remains a significant challenge.
Key issues include the need for high accuracy, low latency, ease of deployment, and scalability
across diverse use cases.
This project focuses on addressing these challenges by developing a real-time object detection
system using the YOLOv10 (You Only Look Once) model, integrated with a web application
and deployed on AWS infrastructure. The system is designed to deliver accurate and efficient
detection results while ensuring ease of use for non-technical users through an intuitive interface.
The project aims to tackle several critical challenges associated with object detection systems:
Proposed Solution
This project proposes a real-time object detection system that addresses the above challenges
through the following features and technologies:
o The YOLOv10 model is employed for its speed and accuracy. Its lightweight
architecture ensures efficient inference without compromising detection
performance.
2. Integration with Flask Framework:
o Flask serves as the backend framework, facilitating communication between the
frontend and the object detection pipeline. It ensures low-latency processing of
uploaded images.
3. Scalable Deployment on AWS:
o The application is hosted on AWS EC2 instances, providing scalable compute
resources. Optional services like CloudFront ensure low-latency global content
delivery.
4. User-Friendly Web Interface:
o The system includes a web-based frontend where users can upload images and
view detection results. This eliminates the need for specialized software or
technical expertise.
5. Asynchronous Communication:
o The use of JavaScript and AJAX ensures real-time interaction between the
frontend and backend, allowing seamless image uploads and result retrieval
without page reloads.
6. Optimized Resource Usage:
o Gunicorn is used as the production server to handle concurrent requests
efficiently. The Flask application and YOLO model are optimized to minimize
resource usage while maintaining high performance.
Goals:
Accuracy: Achieve reliable detection with a focus on minimizing false positives and
false negatives.
Scalability: Ensure the system can handle high traffic and large datasets.
Accessibility: Develop an easy-to-use web application that requires minimal technical
knowledge.
Performance Optimization: Optimize the application for speed and cost-efficiency in
cloud environments.
1.2 Objectives
The primary objective of this project is to design, develop, and deploy a real-time object
detection system capable of accurately identifying and localizing objects within images. By
leveraging the YOLOv10 model, modern web technologies, and cloud infrastructure, the
project seeks to create an efficient, scalable, and user-friendly solution suitable for various
real-world applications. The system aims to empower users with an accessible and robust
tool that delivers precise results in real-time.
Specific Objectives
o Develop an intuitive web interface where users can easily upload images, view
detection results, and interact with the application without requiring technical
expertise.
5. Dynamic Handling of Object Classes:
o Ensure the system can adapt to detect a variety of object classes and provide
detailed information about each detected object, including its confidence score
and precise location.
6. Efficient Backend Processing:
o Integrate Flask as the backend framework to handle API requests, process images,
and return detection results in a structured JSON format. Use Gunicorn to enable
the application to manage multiple concurrent requests efficiently.
7. Interactive Frontend Design:
o Use JavaScript and AJAX to create an interactive and responsive user experience.
Enable real-time communication between the frontend and backend for uploading
images and retrieving results dynamically.
8. Optimized Resource Utilization:
o Optimize compute resource usage on AWS by selecting an appropriate instance
type, fine-tuning the YOLOv10 model, and minimizing latency and resource
consumption.
9. Visualization of Detection Results:
o Provide visual feedback to the user by overlaying bounding boxes, class names,
and confidence scores on the uploaded image, making the results easy to interpret.
10. Security and Reliability:
o Implement secure communication protocols (e.g., HTTPS) to protect user data.
Ensure the application is reliable and capable of handling large-scale deployment
scenarios.
1.3 Motivation
The motivation for this project stems from the need to address these challenges and provide a
robust object detection solution that can serve diverse real-world applications. Leveraging state-
of-the-art technology like the YOLOv10 model, combined with scalable cloud deployment and
an intuitive user interface, this project aims to bridge the gap between advanced AI capabilities
and practical usability.
Key Motivation
Many existing object detection systems struggle to achieve a balance between accuracy
and speed. Real-time processing is essential for applications such as surveillance,
autonomous vehicles, and industrial automation. The YOLOv10 model, known for its
high-speed inference and accuracy, serves as a foundation to meet this need. This project
aims to optimize the model's performance further, ensuring it can handle real-world
scenarios effectively.
Deploying object detection systems often requires significant technical expertise and
resources, making them inaccessible to smaller organizations or individuals. By utilizing
AWS cloud infrastructure, this project demonstrates how such systems can be deployed
efficiently, ensuring global accessibility, scalability, and cost-effectiveness.
Most AI-powered tools require specialized knowledge to operate, limiting their usability
for non-technical users. This project aims to build an intuitive web application that allows
users to upload images and retrieve object detection results effortlessly, lowering the
barrier to entry for AI technologies.
Literature Survey
Object detection has undergone remarkable advancements, transitioning from traditional
methods relying on handcrafted features and algorithms like SIFT, HOG, and Viola-Jones, to
modern deep learning frameworks that leverage neural networks. Early approaches were limited
in robustness and scalability, but the introduction of Convolutional Neural Networks (CNNs)
revolutionized the field with models like R-CNN, Fast R-CNN, and YOLO, which brought
significant improvements in speed and accuracy. Frameworks like SSD and RetinaNet further
enhanced real-time performance and detection of smaller objects. More recent iterations, such as
YOLOv5 and YOLOv10, leverage architectural innovations, data augmentation, and pre-trained
models to push the boundaries of precision and computational efficiency, making object
detection integral to applications like surveillance, autonomous vehicles, and healthcare.
1. The paper “You Only Look Once: Unified, Real-Time Object Detection” by Joseph Redmon,
Santosh Divvala, and Ross Girshick is a cornerstone of modern object detection and directly
relevant to this project. Introduced at CVPR 2016, the YOLO algorithm revolutionized object
detection by framing it as a single regression problem, eliminating the need for multi-stage
processing seen in earlier methods like R-CNN and Fast R-CNN. YOLO uses a unified
convolutional neural network to process an entire image in a single pass, predicting bounding
boxes and class probabilities simultaneously. This approach enables real-time detection,
achieving speeds of up to 45 frames per second, making it suitable for applications requiring
high throughput. The algorithm divides the image into a grid, with each cell responsible for
detecting objects and predicting their locations, classes, and confidence scores. By leveraging the
global context of the image, YOLO achieves competitive accuracy while being significantly
faster than traditional methods. This innovation laid the groundwork for subsequent YOLO
iterations, including YOLOv10 used in this project, further enhancing speed and precision for
real-time object detection tasksThe YOLO algorithm revolutionized object detection by framing
it as a single regression problem, eliminating the need for multi-stage processing seen in earlier
methods like R-CNN and Fast R-CNN. YOLO uses a unified convolutional neural network to
process an entire image in a single pass, predicting bounding boxes and class probabilities
simultaneously. This approach enables real-time detection, achieving speeds of up to 45 frames
per second, making it suitable for applications requiring high throughput. The algorithm divides
the image into a grid, with each cell responsible for detecting objects and predicting their
locations, classes, and confidence scores. By leveraging the global context of the image, YOLO
achieves competitive accuracy while being significantly faster than traditional methods. This
innovation laid the groundwork for subsequent YOLO iterations, including YOLOv10 used in
this project, further enhancing speed and precision for real-time object detection tasks.
3. “Rapid Object Detection using a Boosted Cascade of Simple Features” (2001) : In the early
stages of object detection, spanning the 1990s to the early 2000s, methods primarily relied on
handcrafted features and classical machine learning algorithms. A notable approach was the Haar
cascades method introduced by Viola and Jones in their seminal paper “Rapid Object Detection
using a Boosted Cascade of Simple Features” (2001). This method revolutionized face detection
with its ability to perform real-time detection using simple Haar-like features combined with
AdaBoost for feature selection. Haar cascades were computationally efficient and suitable for
real-time applications on limited hardware, making them a landmark in object detection.
5. Navneet Dalal and Bill Triggs, “Histograms of Oriented Gradients for Human Detection”,
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR), 2005. : The paper “Histograms of Oriented Gradients for Human
Detection” by Navneet Dalal and Bill Triggs introduced a robust feature descriptor, HOG, that
significantly advanced object detection, particularly for identifying humans in images. The
method focuses on capturing the distribution of intensity gradients and edge orientations, which
are critical for recognizing shapes and appearances. HOG divides an image into small cells,
calculates gradient orientations in each cell, and compiles this information into histograms.
These histograms are normalized across overlapping regions, enhancing invariance to lighting
and contrast changes.
By pairing HOG with a Support Vector Machine (SVM) classifier, Dalal and Triggs
demonstrated superior detection rates, especially on the INRIA Person Dataset, outperforming
earlier methods like Haar cascades. The descriptor's ability to extract detailed structural features
made it robust to partial occlusions and background clutter, addressing limitations in previous
feature-based methods. This work laid the foundation for modern feature extraction and
influenced the transition to automated, deep learning-based approaches like YOLO, which
further enhanced efficiency and accuracy.
6. Krizhevsky, A., Sutskever, I., & Hinton, G. E., “ImageNet Classification with Deep
Convolutional Neural Networks”, Advances in Neural Information Processing Systems
(NeurIPS), 2012 : AlexNet revolutionized the field of computer vision by demonstrating the
power of deep learning with Convolutional Neural Networks (CNNs). This paper introduced a
deep CNN architecture that significantly outperformed traditional machine learning models on
the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The network, consisting of
five convolutional layers and three fully connected layers, employed techniques such as
Rectified Linear Units (ReLU) for non-linearity, dropout for regularization, and data
augmentation to reduce overfitting. The success of AlexNet proved the viability of deep learning
for large-scale image classification, setting the stage for its application in various computer
vision tasks, including object detection, which is the foundation for modern approaches like
YOLO.
7. Girshick, R., Donahue, J., Darrell, T., & Malik, J., “Rich Feature Hierarchies for Accurate
Object Detection and Semantic Segmentation”, Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014. : The Region-based Convolutional
Neural Network (R-CNN) introduced by Girshick et al. in 2014 significantly advanced object
detection by combining the power of CNNs with region proposal techniques. R-CNN first
generates candidate object regions using selective search, then extracts CNN-based features from
these regions for classification using a support vector machine (SVM). This two-stage pipeline
resulted in substantial improvements in detection accuracy over previous methods. However, R-
CNN was computationally expensive because it required running the CNN on each proposed
region independently, making it slow and less practical for real-time applications. Despite these
drawbacks, R-CNN set the foundation for subsequent improvements in object detection.
8. Girshick, R., “Fast R-CNN”, Proceedings of the IEEE International Conference on Computer
Vision (ICCV), 2015. : Fast R-CNN was an optimization of the original R-CNN, introduced by
Girshick in 2015. Unlike R-CNN, which extracted features from each region proposal separately,
Fast R-CNN extracted features from the entire image in a single pass through the CNN. It then
used a Region of Interest (RoI) pooling layer to crop the feature maps corresponding to each
proposed region, improving both speed and accuracy. Fast R-CNN also replaced the SVM
classifier with a softmax layer, simplifying the training process. This approach significantly
reduced computation time, making the model more suitable for practical applications, though it
still required external region proposal methods, such as selective search, which hindered its real-
time performance.
9. Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan, “Object
Detection with Discriminatively Trained Part-Based Models”, IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2008. : The paper “Object Detection with Discriminatively
Trained Part-Based Models” by Felzenszwalb et al., published in 2008, introduced a significant
advancement in object detection through the use of part-based models. This approach focused on
detecting objects by modeling them as a collection of parts, each of which could be detected
individually and then combined to form the full object. The authors proposed a discriminative
training approach, where the model was trained to distinguish between positive object instances
and negative background examples. The key innovation was the use of a deformable part model
(DPM), which allowed for variations in object appearance and pose while maintaining a robust
detection framework. By incorporating the spatial arrangement of parts and using efficient
algorithms for part-based detection, DPMs achieved higher accuracy and robustness compared to
earlier methods. While the methods from this paper are not directly used in your project, the
concepts introduced in DPMs, particularly the idea of part-based representation and
discriminative training, influenced later object detection methods, including those integrated in
YOLO. These developments laid the groundwork for more efficient and flexible models that can
detect objects with varied appearances and in different configurations.
10. AWS Documentation, “Amazon EC2, S3, and Elastic Beanstalk Services” : The AWS
documentation on “Amazon EC2, S3, and Elastic Beanstalk Services” provides essential
guidance for deploying scalable and efficient applications in the cloud, making it highly relevant
to the project. Amazon EC2 (Elastic Compute Cloud) offers resizable compute capacity, which
is crucial for hosting the YOLOv10-based object detection model and handling the
computational demands of real-time inference. Amazon S3 (Simple Storage Service) is used for
storing and retrieving large datasets, such as the model weights, input images, and processed
outputs, with high durability and availability. Elastic Beanstalk, a Platform-as-a-Service (PaaS),
simplifies the deployment and management of the Flask-based web application by automating
provisioning, load balancing, and scaling, thereby reducing the overhead of managing
infrastructure. Together, these services enable seamless deployment and scalability of the
project, ensuring robust performance even under varying workloads. The integration of these
AWS services not only supports the computational needs of the project but also ensures cost-
effective and secure hosting, vital for real-world applications.
11. Bochkovskiy, A., Wang, C., & Liao, H., “YOLOv4: Optimal Speed and Accuracy of Object Detection”,
arXiv preprint, 2020. The paper “YOLOv4: Optimal Speed and Accuracy of Object Detection” by
Bochkovskiy, Wang, and Liao (2020) introduced significant advancements to the YOLO (You Only Look
Once) series, making it a pivotal reference for modern object detection models like YOLOv10 used in the
project. YOLOv4 focused on enhancing both accuracy and efficiency, achieving a balance crucial for real-
time applications. The authors incorporated innovative techniques such as Cross-Stage Partial (CSP)
connections to reduce computational complexity and improve gradient flow, Mish activation for
smoother optimization, and mosaic data augmentation to diversify the training data without additional
annotation efforts. Additionally, YOLOv4 leveraged advancements in training strategies like SAT (Self-
Adversarial Training) and CIoU (Complete Intersection over Union) loss to improve object localization
and detection accuracy. These enhancements enabled YOLOv4 to deliver superior performance on
standard benchmarks while maintaining high inference speed, making it suitable for deployment in real-
world scenarios. The foundational concepts and architectural optimizations introduced in YOLOv4
directly influenced the iterative improvements leading to YOLOv10, which integrates these principles
alongside newer technologies such as attention mechanisms for even greater accuracy and versatility.
This lineage underscores YOLOv4's critical role in shaping the advancements employed in the project.
12. Ultralytics, “YOLOv5 Documentation”, 2020. The “YOLOv5 Documentation” by Ultralytics (2020)
marks a significant milestone in the evolution of the YOLO object detection family, emphasizing
usability, efficiency, and deployment readiness. Although not a peer-reviewed paper, YOLOv5
introduced various practical innovations that directly influenced its adoption and subsequent versions
like YOLOv10, which is used in the project. YOLOv5 featured a lightweight and modular architecture,
enabling faster training and inference compared to its predecessors. The inclusion of techniques such as
adaptive anchor generation and auto-learning of input dimensions further streamlined the model’s
adaptability to diverse datasets and applications. YOLOv5 also enhanced the ease of use with its
PyTorch-based implementation, pre-trained weights, and compatibility with deployment tools like ONNX
and TensorRT. These advancements not only improved accuracy and speed but also simplified real-
world deployment, making YOLOv5 an industry favorite for object detection tasks. The focus on
efficiency and deployment aligns closely with the goals of the project, where real-time inference and
scalability are critical. YOLOv5's contributions laid the groundwork for subsequent iterations like
YOLOv10, which integrates these usability improvements alongside cutting-edge technologies such as
attention mechanisms and transformer-based architectures.
13. Vaswani, A., et al., “Attention Is All You Need”, Advances in Neural Information Processing Systems
(NeurIPS), 2017 : The paper “Attention Is All You Need” by Vaswani et al., published in 2017, introduced
the transformer architecture, revolutionizing deep learning by shifting the focus from convolutional and
recurrent models to self-attention mechanisms. This groundbreaking work proposed a model that relied
entirely on attention mechanisms to process input sequences, eliminating the need for recurrence while
achieving state-of-the-art performance in tasks like machine translation. The core innovation, the self-
attention mechanism, allowed the model to weigh the importance of different parts of the input
dynamically, enabling it to capture long-range dependencies more effectively than previous
architectures. Transformers’ scalability and parallelization capabilities further enhanced computational
efficiency. While the paper is not directly about object detection, its principles have profoundly
influenced modern object detection architectures, including YOLOv10, as used in the project. In
YOLOv10, attention mechanisms derived from transformer-based models are employed to enhance
feature extraction and improve detection accuracy, particularly for objects in complex or cluttered
scenes. This integration of attention into object detection frameworks underscores the transformative
impact of Vaswani et al.'s work on a wide range of deep learning applications, including those
implemented in this project.
The solution involves integrating YOLOv10 with a Flask-based API, deploying it on an AWS
EC2 instance, and optimizing performance using tools like Gunicorn and CloudFront. Secure
communication is ensured using SSL certificates, and the user interface is designed with HTML,
CSS, and JavaScript for an interactive experience. This project focuses on delivering accuracy,
speed, and user accessibility while addressing challenges such as processing high-resolution
images and managing concurrent user requests.
3.1 Analysis
The project involves developing a real-time object detection system using the YOLOv10 (You
Only Look Once) model, deployed on Amazon Web Services (AWS). YOLOv10 is selected for
its fast inference capabilities, making it ideal for use cases requiring immediate processing of
images or video frames. The system is designed to detect and classify objects in images,
returning both the detected object labels and their locations as bounding boxes, all in real-time.
This system will be deployed as a web application to ensure accessibility from a variety of
devices, including desktops and smartphones.
The core objective of this project is to create a scalable and responsive application that can
handle incoming requests efficiently while maintaining high accuracy in object detection.
Scalability is a critical factor, as the system must support multiple concurrent users from
different geographical locations, ensuring low latency and fast response times. By deploying the
application on AWS EC2 instances, the system can leverage cloud computing resources,
enabling automatic scaling based on demand. AWS CloudFront further enhances performance by
distributing content globally, reducing the latency for users located far from the server.
One of the primary challenges is ensuring the system can effectively process high-resolution
images while maintaining real-time performance. YOLOv10 requires significant computational
power for inference, and to meet this demand, the application will be hosted on an EC2 instance,
possibly utilizing GPUs to speed up the processing time. However, as the model is deployed in a
cloud environment, managing computational costs and resource allocation becomes crucial.
Efficient use of AWS resources, combined with optimized code, will be key to providing real-
time object detection even with larger image sizes.
Another challenge lies in maintaining secure and reliable communication between the front-
end and back-end. Since the application handles image data, which could be sensitive depending
on the use case (e.g., security surveillance), ensuring encrypted data transmission is essential.
This will be achieved by using SSL certificates, which ensure secure HTTPS communication.
Moreover, the backend API will be designed using Flask, and all requests will be routed through
secure, production-ready servers such as Gunicorn.
The potential use cases for this application are diverse. In security surveillance, the system can
automatically detect intruders or monitor restricted areas, providing alerts based on detected
objects. In autonomous vehicles, the application can be used to identify pedestrians, other
vehicles, traffic signs, and obstacles, ensuring safety and compliance. Similarly, in industrial
monitoring, the system can help identify defective products on assembly lines or monitor
equipment status by detecting issues in real time.
Finally, the user experience is paramount. The system needs to provide fast, accurate feedback to
users, presenting the processed image and detection results in a user-friendly interface. The
front-end will be developed using HTML, CSS, and JavaScript, while AJAX will be used for
asynchronous communication between the user interface and the server. This ensures smooth
interaction without the need for full-page reloads. The success of this project will depend on
overcoming challenges related to performance, scalability, security, and efficient user
interaction.
The hardware requirements for the object detection system can be divided into two categories:
server-side (for deployment on AWS EC2) and client-side (for the user interacting with the
application). These hardware requirements ensure that the system runs efficiently and provides
optimal performance for both processing and user interaction.
For the server-side deployment, the hardware specifications need to meet the computational
demands of running the YOLOv10 model in real time, particularly when handling high-
resolution images. The AWS EC2 instance will serve as the backbone of the application, hosting
both the model and the web API that interacts with users.
Processor: The EC2 instance should be equipped with a quad-core processor, such as
an Intel Xeon or an equivalent CPU, to ensure sufficient computing power for processing
multiple requests concurrently. This is crucial for maintaining low-latency responses
when handling multiple user interactions simultaneously.
Memory: A minimum of 8GB of RAM is recommended. While the YOLOv10 model
itself does not require excessive memory, the additional load from handling concurrent
requests, serving the web interface, and processing images can benefit from ample RAM
to maintain smooth performance and avoid memory-related bottlenecks.
GPU (Optional): While not mandatory, a GPU, such as the NVIDIA T4 or an
equivalent model, is highly recommended for faster inference times. YOLOv10 is a deep
learning model that benefits significantly from GPU acceleration, which can dramatically
reduce the time required to process each image. This is especially important for real-time
applications, where low-latency is critical.
Storage: A minimum of 20GB of SSD storage is required to store the YOLOv10 model
weights, application logs, temporary files, and other system data. An SSD is preferred
over traditional HDD storage to ensure faster read and write speeds, which is essential for
maintaining quick access to the model and efficient data handling.
Client-Side
On the client side, the hardware requirements are focused more on ensuring that the user can
interact with the application through a web interface, without being burdened by performance
constraints.
Device: The client-side device can be a laptop, desktop, or smartphone with sufficient
processing power to handle basic web browsing tasks. The device should have enough
resources to support modern web applications without experiencing significant lag during
image upload or interaction.
Web Browser: The client needs to use a modern web browser (e.g., Chrome, Firefox,
Safari, or Edge) that supports HTML5, JavaScript, and AJAX. These technologies are
necessary to ensure the dynamic interaction between the user interface and the backend
server. AJAX allows for asynchronous communication with the server, enabling seamless
updates to the webpage without the need for full page reloads.
Internet Connection: A stable internet connection is essential to ensure that users can
upload images and receive processed results without interruption. A slower internet
connection may result in longer upload and download times, affecting the overall user
experience, especially when working with high-resolution images.
The software stack for the real-time object detection system using YOLOv10 and deployed on
AWS consists of a combination of operating systems, programming languages, frameworks,
libraries, and cloud services. This stack ensures that the application operates efficiently, is
secure, and delivers high-quality performance for real-time image processing and object
detection. Below is a detailed overview of the software requirements for both the server-side and
client-side of the project.
Operating System:
Server-side: The backend application will run on Ubuntu 20.04 or later. Ubuntu is a
popular Linux distribution that is known for its stability, ease of use, and strong
community support, making it a great choice for server deployments. Ubuntu provides a
reliable environment for installing and running software such as Flask, Gunicorn, and
other dependencies.
Client-side: The client-side can run on Windows, Mac, or Linux. These operating
systems provide the necessary platforms to access the web interface through modern web
browsers like Chrome, Firefox, Safari, or Edge. The client’s device does not require
specialized software beyond a browser, allowing users to interact with the object
detection system seamlessly.
Programming Languages:
Python 3.8+: The backend of the system will be developed using Python, as it is widely
used in machine learning and web development. Python 3.8 or later will be used for
server-side programming, leveraging libraries such as OpenCV, NumPy, and the
Ultralytics YOLO API for object detection tasks.
HTML, CSS, JavaScript: These web technologies will be used on the frontend to create
an interactive user interface. HTML provides the structure of the webpage, CSS handles
styling, and JavaScript is used to implement dynamic functionalities, such as updating
the page with detection results in real-time.
Flask: The Flask web framework will be used to handle API requests on the server-side.
Flask is lightweight and flexible, making it a great choice for developing web
applications with RESTful APIs. It will serve as the backbone of the application,
enabling communication between the frontend (browser) and the backend (server-side
processing).
Ultralytics YOLO API: The Ultralytics YOLO API will be used for object detection.
YOLOv10, implemented through this API, allows for real-time inference on images,
detecting objects and providing bounding box coordinates. The model will be pre-trained
and deployed, ready to process images sent by the client.
OpenCV and NumPy: OpenCV (Open Source Computer Vision Library) is used for
image processing tasks such as decoding base64 images, drawing bounding boxes around
detected objects, and encoding the processed images back into base64 for transmission.
NumPy provides support for handling and manipulating image data as arrays, enabling
efficient processing.
Gunicorn: Gunicorn (Green Unicorn) will serve as the production-ready WSGI server
for hosting the Flask application. It is a high-performance server used to handle
concurrent HTTP requests in a production environment, ensuring the application can
scale effectively under load.
AJAX: AJAX (Asynchronous JavaScript and XML) will be used for frontend
communication with the server. AJAX allows the webpage to send requests to the server
and receive responses without refreshing the page. This is crucial for delivering a smooth,
real-time user experience where image detection results are updated dynamically without
reloading the entire page.
Cloud Services:
AWS EC2: The application will be hosted on AWS EC2 (Elastic Compute Cloud)
instances. EC2 instances provide scalable computing power, which is crucial for running
the YOLOv10 model efficiently. EC2 enables dynamic scaling, so resources can be
adjusted based on traffic and computational needs, ensuring that the system can handle
varying loads.
AWS CloudFront: CloudFront will be used for content delivery across global regions.
It is a Content Delivery Network (CDN) that caches content at edge locations, reducing
latency and providing faster delivery of images and results to users around the world.
CloudFront will also ensure the application remains responsive and accessible regardless
of user location.
Security Tools:
SSL Certificate: To ensure secure communication between the client and the server, an
SSL certificate will be implemented, enabling HTTPS. HTTPS ensures that all data
transmitted between the client (browser) and server is encrypted, protecting sensitive
information such as images and detection results.
Certbot: Certbot will be used to automate the installation and renewal of SSL
certificates. Certbot is an open-source tool that simplifies the process of obtaining a
trusted SSL certificate from a Certificate Authority (CA), ensuring the application
remains secure without manual intervention.
The system architecture for the real-time object detection application using YOLOv10 deployed
on AWS is designed for scalability, efficiency, and low-latency processing. The architecture is
composed of key components that interact with each other to enable seamless object detection
from the user’s device to the backend, providing quick and accurate results. Below is a detailed
description of each component in the system architecture.
1. User (Frontend)
The user is the entry point of the system. Through a modern web browser on their laptop,
desktop, or smartphone, the user captures an image or uploads a stream of images to the
application. The frontend captures image data, which is then sent to the backend for processing.
This interaction is handled by AJAX, allowing for seamless communication without the need to
refresh the entire page. The frontend displays the processed image along with the detected
objects in real-time.
Key Responsibilities:
Display results, including detected objects and bounding boxes, on the user interface.
The Flask server is the backbone of the application, handling incoming HTTP requests from the
frontend. It receives the image data, processes it by passing it to the YOLOv10 model, and
returns the results to the user. Flask serves as the API layer, ensuring communication between
the user’s browser and the object detection model. It also coordinates various aspects of the
system, such as error handling, data formatting, and ensuring that the correct response is sent to
the client.
Key Responsibilities:
Return processed images and detection results (e.g., bounding boxes, class labels) to the
frontend.
3. YOLOv10 Model
The YOLOv10 model performs the core function of object detection. It analyzes the image sent
by the frontend, runs inference to detect objects, and provides outputs such as bounding box
coordinates, class labels, and confidence scores for each detected object. YOLOv10 (You Only
Look Once) is a state-of-the-art deep learning model designed for real-time object detection,
which enables fast processing and high accuracy in identifying multiple objects in a single
image.
Key Responsibilities:
AWS EC2 (Elastic Compute Cloud) serves as the hosting platform for the Flask application. The
EC2 instance runs the Flask server, processes the incoming requests, and manages interactions
with the YOLOv10 model. EC2 instances provide scalable and flexible compute resources,
ensuring that the application can scale based on the volume of incoming requests. It also hosts
the environment necessary to run the machine learning model, ensuring reliable and consistent
performance.
Key Responsibilities:
5. CloudFront CDN
AWS CloudFront, a Content Delivery Network (CDN), improves the speed and performance of
the application by distributing content globally through edge locations. It caches frequently
accessed content, reducing latency for users by serving data from the closest geographical
region. CloudFront ensures that users, regardless of their location, experience minimal delay
when interacting with the application, making it essential for real-time object detection
applications.
Key Responsibilities:
Although not a mandatory part of the system, a database or storage service may be used for
logging and storing images or results. For instance, AWS S3 could be utilized to store large
image files or detection logs for future analysis or review. This component is optional but useful
for monitoring system performance, tracking user interactions, or storing historical detection
results.
Key Responsibilities:
Data Flow Diagram (DFD) for Real-Time Object Detection System Using YOLOv10
The Data Flow Diagram provides a more detailed view of the steps involved in the real-time
object detection process, from image upload to detection results being returned to the user. It
consists of several key interactions between the components of the system, as follows:
1. User uploads image (frontend): The process starts with the user uploading an image
through the frontend interface. This is done by selecting or dragging an image file into
the web interface. The frontend is typically a web page running on a browser, allowing
the user to interact with the system. The image is prepared for transmission to the server.
2. AJAX sends image data to Flask API: Once the image is selected, the frontend uses
AJAX (Asynchronous JavaScript and XML) to send the image data to the Flask API
running on the backend server. AJAX is used to ensure the page does not reload during
the image transmission, providing a seamless and real-time user experience. The image is
sent as base64 encoded data to the backend.
3. Flask processes image and runs YOLO inference: Upon receiving the image, the Flask
API processes the data and prepares it for object detection. The server decodes the base64
image data into a format that can be processed by the YOLOv10 model. The server then
uses the YOLOv10 model, which is a pre-trained deep learning object detection model, to
run inference on the image. The YOLO model performs object detection by identifying
objects in the image and drawing bounding boxes around them.
4. YOLO model detects objects and annotates the image: The YOLOv10 model analyzes
the image and detects various objects based on its training. Each detected object is
associated with a class label (e.g., car, person, dog) and a confidence score. The model
annotates the image by drawing bounding boxes around each detected object, as well as
labeling the object with its class name and confidence score. This allows for visual
feedback that the detection is happening correctly.
5. Flask sends the processed image and detection details to the user: Once the YOLO
model has completed the object detection and image annotation, the Flask API sends the
processed image back to the frontend. Along with the image, detection details such as the
bounding boxes, object classes, and confidence scores are sent to the frontend. This data
is typically formatted in JSON and is used by the frontend to display the results in real-
time. The image with the annotations is shown to the user, allowing them to view the
detected objects.
A Use Case Diagram is a useful way to visually represent the interactions between system
components and users, showing how the system performs tasks from the user’s perspective. In
the context of a real-time object detection system using YOLOv10, the Use Case Diagram
defines the roles of various actors and the processes they interact with within the system. The
diagram primarily focuses on the user and the system’s functionality.
Actors:
1. User: The user is the primary actor in this system. They interact with the frontend
interface of the application. The user can upload images and view the results of object
detection. Their main responsibility is to provide input (an image) and receive output (the
processed image with detected objects).
2. System: The system, typically the backend portion of the application, includes the Flask
API, YOLOv10 model, and the server hosting the application. The system is responsible
for processing the image, running YOLOv10 for object detection, and sending back the
results to the user.
Use Cases:
1. Upload Image:
o Description: This use case represents the user's action of uploading an image to
the system. It is initiated by the user through the frontend interface, where they
select an image file. The image is sent to the backend using AJAX, ensuring that
the page does not reload, which facilitates a smooth and uninterrupted user
experience.
o Flow: The user selects the image, and it is transmitted as base64-encoded data to
the server, where it is prepared for further processing.
2. Run Object Detection:
o Description: After the image is uploaded, the backend (Flask server) receives the
image and initiates the object detection process using the YOLOv10 model. The
model performs inference on the uploaded image, identifying objects within it and
drawing bounding boxes around the detected items. The YOLO model returns the
detection results, including the class labels and confidence scores for each
detected object.
o Flow: The system processes the image and runs YOLOv10 inference, then
annotates the image with detection boxes. This process is crucial to transforming
the raw image into a meaningful output.
3. Display Detection Results:
o Description: This use case represents the final step of the process, where the
system sends the processed image back to the frontend along with the detection
details. The user can view the image with annotated bounding boxes, class labels,
and confidence scores. The system also sends any additional metadata related to
the detection for display purposes.
o Flow: The processed image, along with detection details (e.g., class names and
bounding box coordinates), is sent back to the user. The frontend interface
displays these results in real-time, allowing the user to view the annotated image
with all detected objects.
Relationships:
The User interacts with the Upload Image use case to provide the image input. The
System is responsible for processing this input, running the YOLOv10 model for object
detection, and displaying the detection results back to the user.
The System links the Run Object Detection and Display Detection Results use cases to
ensure that object detection and result display occur smoothly and efficiently.
Diagram Components:
Actor (User): Interacts with the system by uploading images and viewing results.
Use Case (Upload Image): Represents the process where the user uploads an image to
the system.
Use Case (Run Object Detection): The system processes the uploaded image and
performs object detection using YOLOv10.
Use Case (Display Detection Results): The system returns the processed image and
detection results to the user.
System (Backend): The backend system processes the user’s input (image) and returns
detection outputs.
A sequence diagram outlines the flow of interactions between the user, system components, and
the YOLOv10 model in a step-by-step sequence. It captures the dynamic behavior of the system,
detailing how components interact in a timely manner to achieve object detection and result
delivery.
1. Actor (User):
o The user is the initiator of the object detection process. Their primary role is to
upload an image and view the results after detection.
o They interact with the frontend interface to initiate the sequence.
2. Frontend:
o The user interface captures the image uploaded by the user.
o It encodes the image into base64 format and sends the data asynchronously to the
backend using AJAX.
o Once the processed image and detection results are received from the backend,
the frontend decodes and displays them to the user.
3. Backend (Flask):
o Acts as the intermediary between the frontend and the YOLOv10 model.
o Receives the encoded image data from the frontend, decodes it into a usable
format, and passes it to the YOLO model for inference.
o Processes the detection results, annotates the image with bounding boxes, and
encodes the processed image for return to the frontend.
4. YOLOv10 Model:
o The core component that performs object detection.
o Processes the image provided by the backend, identifies objects, and returns
detection outputs such as class labels, confidence scores, and bounding box
coordinates.
Sequence of Interactions: