R-CNN - Region-Based Convolutional Neural Networks
Last Updated :
12 Jul, 2025
Traditional Convolutional Neural Networks (CNNs) with fully connected layers often struggle with object detection tasks, especially when dealing with multiple objects of varying sizes and positions within an image. A brute-force method like applying a sliding window across the image to detect objects is highly computationally expensive, as it fails to scale efficiently when object frequency and variation increase.
To overcome these challenges, R-CNN (Regions with CNN features) was introduced. R-CNN presents a smarter approach by using a selective search algorithm to generate around 2,000 region proposals from an image. These proposals are likely to contain objects and are individually processed to detect and localize them more efficiently. R-CNN marked a significant advancement in the field of object detection and laid the foundation for faster and more accurate object detection models.
R-CNN Working
R-CNN Working- Input Image: Start with a single input image containing one or more objects.
- Region Proposal Generation: Use Selective Search to generate around 2,000 region proposals (potential object locations).
- Warp & Feature Extraction: Each proposed region is cropped and resized (warped) to a fixed size. Then pass each region through a CNN to extract feature vectors.
- Region Classification: Use the extracted features to classify each region using SVMs into object categories (e.g. person, car) or background.
Key Features of R-CNNs
1. Region Proposals
R-CNNs begin by generating region proposals, which are smaller sections of the image that may contain the objects we are searching for. The algorithm employs a method called selective search, a greedy approach that generates approximately 2,000 region proposals per image. Selective search effectively balances the number of proposals while maintaining high object recall, ensuring efficient object detection.
By limiting the number of regions for detailed analysis, this method enhances the overall performance of the R-CNN in detecting objects within images.
2. Selective Search
Selective Search is a greedy algorithm that generates region proposals by combining smaller segmented regions. It takes an image as input and produces region proposals that are crucial for object detection. This method offers significant advantages over random proposal generation by limiting the number of proposals to approximately 2,000 while ensuring high object recall.
Algorithm Steps:
- Generate Initial Segmentation: The algorithm starts by performing an initial sub-segmentation of the input image.
- Combine Similar Regions: It then recursively combines similar bounding boxes into larger ones. Similarities are evaluated based on factors such as color, texture, and region size.
- Generate Region Proposals: Finally, these larger bounding boxes are used to create region proposals for object detection.
The selective search algorithm provides an efficient way to identify potential object regions, enhancing the overall effectiveness of the detection process.
For a more detailed exploration of the selective search algorithm, please refer to the full discussion in this article.
After generating the region proposals, these regions are warped into a uniform square shape to match the input dimensions required by the CNN model. In this case, we use the pre-trained AlexNet model, which was considered the state-of-the-art CNN for image classification at the time.

The input size for AlexNet is (227, 227, 3), meaning each input image must be resized to these dimensions. Consequently, whether the region proposals are small or large, they need to be adjusted accordingly to fit the specified input size.
From the above architecture, we remove the final softmax layer to obtain a (1, 4096) feature vector. This feature vector is then fed into both the Support Vector Machine (SVM) for classification and the bounding box regressor for improved localization.
4. SVM (Support Vector Machine)
The feature vector generated by the CNN is then utilized by a binary Support Vector Machine (SVM), which is trained independently for each class. This SVM model takes the feature vector produced by the previous CNN architecture and outputs a confidence score indicating the likelihood of an object being present in that region.
However, a challenge arises during the training process with the SVM: it requires the AlexNet feature vectors for each class. As a result, we cannot train AlexNet and the SVM independently and in parallel.
5. Bounding Box Regressor
To accurately locate the bounding box within the image, we utilize a scale-invariant linear regression model known as the bounding box regressor. For training this model, we use pairs of predicted and ground truth values for four dimensions of localization: (x, y, w, h). Here, x and y represent the pixel coordinates of the center of the bounding box, while w and h indicate the width and height of the bounding boxes, respectively.
This method enhances the Mean Average Precision (mAP) of the results by 3-4%.

To further optimize detection, R-CNNs apply Non-Maximum Suppression (NMS):
- Remove proposals with confidence scores below a threshold (e.g., 0.5).
- Select the highest-probability region among candidates for each object.
- Discard overlapping regions with an IoU (Intersection over Union) above 0.5 to eliminate duplicate detections, where IoU is defined as:
\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}
By combining region proposals, selective search, CNN-based feature extraction, SVM classification, and bounding box refinement, R-CNN achieves high accuracy in object detection, making it suitable for various applications.
After that, we can obtain output by plotting these bounding boxes on the input image and labeling objects that are present in bounding boxes.
Results of R-CNN Model
The R-CNN gives a Mean Average Precision (mAPs) of 53.7% on VOC 2010 dataset. On 200-class ILSVRC 2013 object detection dataset it gives an mAP of 31.4% which is a large improvement from the previous best of 24.3%. However, this architecture is very slow to train and takes ~ 49 sec to generate test results on a single image of the VOC 2007 dataset.
Evolution of R-CNN: Fast R-CNN and Mask R-CNN
Following the introduction of R-CNN, several variations emerged to address its limitations:
1. Fast R-CNN
Fast R-CNN optimizes the R-CNN architecture by sharing computations across proposals. Key improvements include:
- Single Stage Processing: Instead of extracting features for each region proposal independently, Fast R-CNN processes the entire image once through the CNN to generate a feature map. The region proposals are then extracted from this shared feature map.
- Softmax Classifier: Fast R-CNN replaces the SVM with a softmax classifier, allowing for end-to-end training of the network.
- Improved Bounding Box Regression: Fast R-CNN enhances the bounding box regression process, leading to better localization accuracy.
2. Faster R-CNN
Faster R-CNN further advances the R-CNN framework by incorporating a Region Proposal Network (RPN). Key features include:
- Region Proposal Network: The RPN generates high-quality region proposals directly from the feature maps produced by the CNN, eliminating the need for selective search.
- Shared Convolutional Features: Both the RPN and the detection network share the convolutional features, significantly reducing computation time.
- Improved Speed: Faster R-CNN achieves real-time processing speeds of around 0.1 seconds per image while maintaining high detection accuracy.
3. Mask R-CNN
Building upon Faster R-CNN, Mask R-CNN was introduced to extend the model to perform instance segmentation. Key features include:
- Segmentation Masks: In addition to bounding boxes, Mask R-CNN predicts a segmentation mask for each detected object, providing pixel-level accuracy.
- Feature Pyramid Networks (FPN): Mask R-CNN incorporates FPNs to improve performance on objects at different scales, enhancing detection accuracy for small objects.
- RoIAlign: This technique replaces RoIPooling to address misalignment issues, ensuring better feature extraction for each region of interest.
4. Cascade R-CNN
Cascade R-CNN implements a multi-stage object detection framework to improve detection performance. Key aspects include:
- Multi-Stage Detection: Cascade R-CNN employs a series of detectors operating at different stages, progressively refining the proposals and improving localization accuracy.
- Improved Recall and Precision: By addressing the trade-off between recall and precision at each stage, the model enhances overall detection performance, especially on challenging datasets.
Applications of R-CNN
- Autonomous Vehicles: R-CNN can detect and classify various objects on the road, such as pedestrians, other vehicles, and traffic signs, contributing to safer navigation.
- Surveillance Systems: In security applications, R-CNN can identify suspicious activities by detecting and classifying individuals and objects in real-time.
- Medical Imaging: R-CNN is used in medical applications to identify anomalies in medical scans, assisting in early diagnosis and treatment.
- Augmented Reality: R-CNN can enable object recognition in augmented reality applications, enhancing user experiences by overlaying digital information on the real world.
Challenges of R-CNN
R-CNN faces several challenges in its implementation:
- Rigid Selective Search Algorithm: The selective search algorithm is inflexible and does not involve any learning. This rigidity can result in poor region proposal generation for object detection.
- Time-Consuming Training: With approximately 2,000 candidate proposals, training the network becomes time-intensive. Additionally, multiple components need to be trained separately, including the CNN architecture, SVM model, and bounding box regressor. This multi-step training process slows down implementation.
- Inefficiency for Real-Time Applications: R-CNN is not suitable for real-time applications, as it takes around 50 seconds to process a single image with the bounding box regressor.
- Increased Memory Requirements: Storing feature maps for all region proposals significantly increases the disk memory needed during the training phase.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice