Indian Institute of Information Technology, Allahabad

Computer Vision and Biometrics Lab (CVBL)

Visual Recognition

Jan-May 2022 Semester

Previous Offerings

Visual Recognition 2021

Course Information

Objective of the course: The field of visual recognition has become part of our lives with applications in self-driving cars, satellite monitoring, surveillance, video analytics particularly in scene understanding, crowd behaviour analysis, action recognition etc. It has eased human lives by acquiring, processing, analyzing and understanding digital images and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information. The visual recognition encapsulates image classification, localization and detection. The course on visual recognition will help students understand new tools, techniques and methods which are influencing the visual recognition field.

Outcome of the course: At the end of this course, the students will be able apply the concepts to solve some real problems in recognition. The students will be able to use computational visual recognition for problems ranging from extracting features, classifying images, to detecting and outlining objects and activities in an image or video using machine learning and deep learning concepts. The student will be also being able to invent new methods in visual recognition for various applications.

Class meets
Wednesday: 11.00 AM - 01.00 PM, Thursday: 07.00 - 09.00 PM, Friday: 03.00 - 05.00 PM

Course Ethics
  • Students are strictly advised to avoid the unethical practices in the course including review tests and practice components.
  • The project component will be done in team. The team will be formed by the course instructors. The project allotment will be also done by the course instructors.
  • Students are not allowed to simply claim the existing solutions available in public domain as your own work in this course.
  • If it happens that you have already done the similar projects in any other course or with any other faculty which is allotted to you, you should immediately inform us for the same as it is not allowed to have similar projects in this course which you might have already done previously.
  • It is best to try to solve problems on your own, since problem solving is an important component of the course.
  • You are not allowed to do or continue same project in any other course and with any other faculty.
  • You are allowed to discuss class material, problems, and general solution strategies with your classmates. But, when it comes to formulating or writing solutions you must work/implement by yourself.
  • You may use free and publicly available sources, such as books, journal and conference publications, and web pages, as research material for your answers. (You will not lose marks for using external sources.) It is does not mean that you claim these existing resources as your work.
  • You may not use any paid service and you must clearly and explicitly cite all outside sources and materials that you made use of.
  • Students are not allowed to post the code/report/any other material of course project in public domain or share with any one else without written permission from course instructors.
  • We consider the use of uncited external sources as portraying someone else's work as your own, and as such it is a violation of the Institute's policies on academic dishonesty.
  • Instances will be dealt with harshly and typically result in a failing course grade.


Date Topic Resources
L01: Jan 10, 2022Course Introduction
Slide, Recorded Lecture
L02: Jan 13, 2022Local Features: What, Why and How
Slide, Recorded Lecture
L03: Jan 19, 2022Corner Detection
Slide, Recorded Lecture
L04: Jan 21, 2022Harris Detector and Invariance Property
Slide, Recorded Lecture
L05: Jan 29, 2022Blob and Region Detection
Slide, Recorded Lecture
L06: Feb 02, 2022Region Descriptors
Slide, Recorded Lecture
L07: Feb 09, 2022Local Descriptors
Slide, Recorded Lecture
L08: Feb 09, 2022Image Categorization
Slide, Recorded Lecture
L09: Feb 16, 2022Image Classifiers
Slide, Recorded Lecture
L10: Feb 18, 2022Neural Networks
Slide, Recorded Lecture
L11: March 02, 2022Convolutional Neural Networks
Slide, Recorded Lecture
L12: March 04, 2022CNN Training 1
Slide, Recorded Lecture
L13: March 09, 2022CNN Training 2
Slide, Recorded Lecture
L14: March 16, 2022CNN Architectures 1
Slide, Recorded Lecture
L15: March 23, 2022CNN Architectures 2
Slide, Recorded Lecture 1, Recorded Lecture 2
L16: March 30, 2022Object Detection
Slide, Recorded Lecture
L17: April 06, 2022Adversarial Attack
Slide, Recorded Lecture
L18: April 13, 2022Generative Models
Slide, Recorded Lecture

Computational Projects Added to Teaching Laboratories

Project ID Team Project Title Abstract
VR22-P01 IIT2019004 Naina Kumari, IIT2019006 Asha Jyothi Donga, IIT2019017 Shruti Nanda, IIT2019023 Utkarsh Gangwar Face Recognition using Face Super-resolution Some research has already been done to improve face recognition and detection in low resolution environments, however, there is a lack of research in augmenting existing face recognition and identification systems with image optimizations, to have a check on the performance of face detection under the constraint of low resolution images. In this paper, we have proposed a methodology where we have used a predefined python library of “face recognition” which provides us the pretrained model HoG (Histogram of Oriented Gradient) used for Face Detection. For image enhancement, we have used 3 different super resolution techniques namely, nearest neighbour scaling, bicubic scaling, image super-resolution algorithm known as EDSR x4. Also, we have used SVM (Support Vector Machine) and KNN (K Nearest Neighbours) as our two models. Experimental results on the LFW (for training) and LFW3D (for testing) dataset demonstrate that our method achieves satisfying results with improved accuracies as we move from using nearest neighbour super resolution technique to EDSR technique.
VR22-P02 IIT2019025 Ritesh Raj, IIT2019027 Vidushi Pathak, IIT2019036 Jyotika Bhatti, IIT2019045 Amit Singh Face Sketch Recognition using Sketch-to-Face Synthesis This work addresses the problem of generating a face from a given sketch image. This task is quite challenging because of few reasons one among them is the large gap between a sketch and an actual face. During our research we faced issues regarding the insufficient training pairs to train the model, we tackled this issue by using data augmentation on the present images. We have discussed the applications of converting sketches to faces and their recognition. This conversion involves learning the mappings between these two different domains of images. Generative Adversarial Network models such as Pix2Pix, CycleGAN,DCGAN and BicycleGAN are used to acheive this. Our methodology includes Composition-aided GAN, DCGAN, CUHK, CUFS.
VR22-P03 IIT2019077 Gade Srinivas Priyatham Reddy, IIT2019098 Abhinav, IIT2019112 Payili Vangmayi, IIT2019118 Shikhar Gupta Person Synthesis under Different Clothing Style With the rapid increase of social media and dressing sense in society, there is a huge need to work on things like virtual try-ons where a person can view his/her appearance in different clothes and choose accordingly. In this project, we use an existing model namely Adaptive Content Generating and Preserving Network (ACGPN) which basically involves 3 stages of try-on generation. All the experiments are conducted on the VITON dataset consisting of 14,221 training pairs and 2,032 testing pairs. Results are measured using quantitative measuring metrics like SSIM score, IS score and FID score over different difficulty level of datasets. Results show the better performance of ACGPN over existing methods.
VR22-P04 IIT2019119 Prakash Toppo, IIT2019121 Gurmeet Singh, IIT2019125 Aakash Bishnoi, IIT2019129 Sanyam Agarwal Person Recognition from Aerial View There are many real life challenges like finding objects in large scale images captured by aerial view and it is very difficult to find these objects due to the tiny shape of images. In this experiment the main focus is to detect people(humans) from aerial images.There are many methods that existed previously like RCNN, R-FCN, feature pyramid network etc. We are going with a Scale Selection Pyramid network (SSPNet), which consists of three components: Context Attention Module (CAM), Scale Enhancement Module (SEM), and Scale Selection Module (SSM) for tiny person detection. SSPNet is an extended approach of FPN, where detectors focus on specific areas instead of reading the whole image and further layers improve results by sharing previous results in between deep layers.
VR22-P05 IIT2019131 Priyanshu Jain, IIT2019133 Azmeera Mounika, IIT2019137 Harsh Abhijit Thete, IIT2019140 Sagar Barman Hyperspectral Image Classification While the human eye is only able to perceive twodimensional colour images consisting of three channels, hyperspectral images (HSI) have a range of spectrums not accessible to the human eye. Hyperspectral images are generated by hyperspectral cameras. They have a wide range of wavelengths, have high spectral and spatial resolutions, and comprise a wealth of information. In hyperspectral image classification[HSIC], each pixel is assigned class labels using the spatial and spectral information in the image. Convolutional Neural Networks (CNN) have been extensively studied for hyperspectral image classification. In this paper, we employ a 3D followed by a 2D convolutional neural network model on the hyperspectral images so as to obtain the important spectral and spatial features. This model is preprocessed using the dimensionality reduction technique. Different dimensionality reduction techniques were tested to determine the effect they have on this model’s performance on the Indian Pines dataset.
VR22-P06 IIT2019141 Khushi Gupta, IIT2019145 Paras Agrawal, IIT2019155 Ritik Parmar, IIT2019158 Aryan Dhakad Unsupervised Image Rerieval Image retrieval technology is a very fast-growing digital technology for researchers in the field of computer science for a very long period. It is a technique for retrieving digital images from a large database. The well-known organizations that are using this technique are Google and Pinterest. In this conference paper, a content-based image retrieval system that uses an ingenious type of neural network known as autoencoder is discussed and developed a basic system to understand it. The methodology that has been used is an unsupervised method which is a machine learning algorithm in which the system retrieves images without searching for their name, labels, and tags. This system retrieves images just by its visual information. This approach to image retrieval is known as Content-Based Image Retrieval (CBIR).
VR22-P07 IIT2019160 Tejas Dutta, IIT2019161 Aadharsh Roshan Nandhakumar, IIT2019162 Vishal Burman, IIT2019164 Saksham Sood Self-supervised Image Retrieval Self supervised product quantization is a strategy proposed by Young Kyun Jang and Naam Il cho is trained in a label free and self supervised manner using Cross quantized contrastive learning strategy. This method jointly learns code words and descriptive features by contrasting two randomly augmented views of the image. In this work, we extend this method to include n views of the image and compare the results.
VR22-P08 IIT2019166 Arun Kumar, IIT2019167 Ansh Verma, IIT2019173 Sankalp Rajendran, IIT2019177 Rohit Kumar Gupta Network Pruning for Faster Face Recognition Pruning is one of the techniques used to create light-weight neural network models so that it can be used in mobile devices which are resource constrained while maintaining accuracy. Existing methods for pruning include fine tuning, retraining after pruning, gradual pruning and weight rewinding. In this paper we have tried to recreate the results which were achieved in the paper proposed by Renda et al. where they retrain the final unpruned weights using the primary training schedule. Resnet 50 architecture was trained on CIFAR 10 dataset to achieve 88.5% accuracy, less than 1% accuracy reduction after pruning and 26% reduction in weights.
VR22-P09 IIT2019179 Sharma Sahil, IIT2019180 Rajveer, IIT2019183 Devender Kumar, IIT2019184 Pratyush Pareek Person Image Synthesis in Random Poses In this work, we have attempted to synthesis person images in novel poses, i.e., from a person’s image in a specific pose and a desired pose, we attempt to produce the image of that person in the desired pose without losing the identity of the person or the background. To do so, we have used a module of attention-based generative networks trained using pairs of images of persons and their corresponding poses that slices an image into several layers corresponding to distinct parts of the body and the background layer, transforms the pose by moving the sections to their desired positions, refines the background and then integrates the new foreground with the new background to generate the desired image.
VR22-P10 IIT2019185 R Shwethaa, IIT2019186 Shah Udgam Birenbhai, IIT2019189 Nidhi Kamewar, IIT2019196 Priyanshu Face Recognition under Mask Recent research made significant progress in the field of face recognition. With the help of large-scale training data sets that include distorted, blurred, rotated, and discolored images, facial recognition has improved to provide results with increased accuracy. However, these techniques don’t perform up to the mark when the recognition system is presented with occluded faces. The COVID19 pandemic and mask mandate in various countries presents a huge challenge to existing face recognition systems. In this work, we propose a method to fine-tune the pre-trained VGGFace model and get the high accuracy for recognition of masked faces.
VR22-P11 IIT2019202 Jyoti Verma, IIT2019204 Mitta Lekhana Reddy, IIT2019208 Dhanush Vasa, IIT2019219 Gitika Yadav Facial Micro-expression Recognition Even humans have difficulty recognizing fake facial expressions. In the meantime, the field of computer vision is exploring how to recognize facial expressions in video. In contrast to true emotions, which are expressed through facial microexpressions, facial expressions are a spontaneous reaction to something. Despite a few attempts made at identifying micro expressions, state-of-the-art methods are not able to identify them with high accuracy. To identify micro-expressions from still images, several CNN based papers were studied in the literature. In this paper what we do is we propose a Micro Expression Recognition CNN model on the MicroExpressions dataset by analyzing the face of the person in question. The proposed CNN model has shown very promising results. As represented by the results at the end of the paper.
VR22-P12 IIT2019221 Divyansh Rai, IIT2019226 Mukul Mohmare, IIT2019229 Navneet Yogesh Bhole, IIT2019230 Eshan Vaid Histopathological Colon Cancer Recognition With the advent of technologies, medical facilities and computer vision techniques have largely evolved in the past decade. Providing medical assistance in the form of chat box and computer aided diagnostics, has become a major topic of research opportunity for the researchers in the field of artificial intelligence.Cancer in the colon region is rated as the third most dangerous form of cancer, and thus efficient detection of cancer is of high importance. To understand the medical condition just from an image or report is a difficult task for most people with no academic background relating to medicine.This work features residual networks and a modified multi-featured capsule network with the usage of pooling layers.Improved results over this task is a step towards better computer-aided procedures and diagnosis.
VR22-P13 IIT2019236 Noonsavath Sravana Samyukta, IIT2019240 Ayush Khandelwal, IEC2019019 Vishwaas Pratap Singh, IEC2019036 Harsh Ranjan Identity Recognition using Palmprint The motivation behind the project is to have a unique, reliable, convenient, stable and more secure personal features has invoked increasing interest in the development of biometric based identification systems, which is based on something who we are. Fingerprints are currently being used in daily life but have the problem of small region of interest and tends to change a lot in a day to day life because of different reasons.Hence we prefer to tend to palm-print based recognition methods for better results. In this paper we’ll be building a face recognition model that uses Siamese Networks on the base of Triplet Loss Function to give us a distance value that indicates whether 2 images are same or different.
VR22-P14 IEC2019053 Chandan Ahire, IEC2019061 Prabhnoor Singh, IEC2019070 Priyansha Gupta, IEC2019071 Anurag Sharma Identity Recognition using Knuckleprint Security is a major concern related to many things that happen around us. Let it be a bank transaction or an email account verification, secure systems play an important role in it. Hand based biometrics perform really well when it comes to identity verification of a person. Fingerprint has been the major candidate of the past years for this purpose and has given very promising results. In our research we want to move the spotlight to a lesser known identity recognition candidate, finger knuckle print. Finger knuckle has very rich and intricate patterns and these patterns have broader edges compared to the fingerprint which makes it easy for even a low resolution sensor to detect these patterns, also finger knuckle doesn’t require physical contact to detect the person which makes it a safer alternative in case of a pandemic. Although the finger knuckle does contain as much information as a fingerprint, it can be used alongside the fingerprint to strengthen the security system. To achieve this goal we have used transfer learning using a pretrained model of the EfficientNetV2B0 to detect the hidden features, these hidden features are then used by the classification layer to classify the images. The dataset that we have used is the IIT Delhi Finger Knuckle Database 1.0 which has a total of 790 images collected from 158 subjects.
VR22-P15 IEC2019074 Ravi Agrawal, IEC2019075 Deepak Gupta, IEC2019079 Sachin Kanyal, IEC2019086 Udhav Rana, IIT2017062 Kaustubh Chetan Parmar Transformer based COVID19 Recognition from X-Ray Coronavirus is a worldwide pandemic, and distinguishing them is an earth shattering errand for clinical experts today because of its fast transformations. Current strategies for inspecting chest X-beams and CT examine requires significant information and are tedious, which proposes that it recoils the valuable season of clinical experts when people’s lives are in question. This review attempts to assist this cycle by accomplishing cutting edge execution in characterizing chest X Rays by calibrating Vision Transformer (ViT). The proposed approach utilizes a pretrained model (google/vit-basepatch16- 224) which is trained on ImageNet-21k (14 million images, 21,843 classes). The database contains around 21k images which includes 3616 covid, 6012 lung-opacity, 1345 viral pneumonia and 10192 normal chest x-ray images.This would be especially helpful in this epidemic, because the illness load and the necessity for preventative measures are in conflict with available resources.
VR22-P16 MIT2021046 Koppula Krishna Sai, MIT2021059 Anwesh Panda, MIT2021079 Saurav Sagar, MIT2021082 Dhote Anurag Radhesham Pose Invariant Face Recognition In recent years face recognition has become a significant problem due to its versatile applications in computer vision. Hence the problem has seen tremendous progress ever since the inception of deep learning methods. However, face recognition under different poses remains a challenge owing to the features present in hidden feature maps which render the representation being pose-variant. Pose invariant Face Recognition refers to the ability of a system to identify a facial representation despite pose variations. Compared to frontal face recognition, pose invariant face recognition remains widely unsolved. The state of the art facial recognition techniques perform poorly when pose changes are introduced to an image. In this project we attempt to address this problem using deep learning techniques that exploit the hidden features which lead to pose variations, present in the hidden layers of neural networks.
VR22-P17 RSI2022003 Suvramalya Basak Action Recognition Action Recognition is a widely studied task in computer vision due to its potential for use in many applications such as intelligent surveillance, virtual reality, robotics, etc. Temporal information is vital for video action recognition, as the temporal domain contain several hints which when leveraged can led to much better performance. Another challenge is the early detection of an action. Making a prediction from only the initial few frames from a video is especially difficult as not enough information is found. As a solution to this task, a teacher student approach has been taken. Knowledge from the teacher network is used by the student network to learn to predict actions early. The teacher model is an action recognition model trained on the entire video frames. In this work the ACTION-net model has been used as the teacher model. This model uses an embedded multi-path excitation module inside a Resnet-50 model. This model is much more computationally inexpensive compared to 3D-CNNs and two stream architectures, as it uses 2D-CNNs with the added module to capture multi-type information. In this work, the student is also an ACTION-net model which is trained using a distillation loss from the pre-trained teacher ACTION-net model. The results are tested on a subpart of the UCF50 dataset. The task is to recognize 50 different action classes.
VR22-P18 RSI2021003 Neeraj Baghel Image Super-resolution Image super-resolution aims to synthesize highresolution image from a low-resolution image. It is an active area to overcome the resolution limitations in several applications like low-resolution object recognition, medical image enhancement, etc. In recent years, the image super-resolution has witnessed a huge progress using deep learning methods. The generative adversarial network (GAN) based methods have been the state-ofthe- art for image super-resolution by utilizing the convolutional neural networks (CNNs) based generator and discriminator networks. However, the CNNs are not able to exploit the global information very effectively in contrast to the transformers, which are the recent breakthrough in deep learning by exploiting the self-attention mechanism. Motivated from the success of transformers in language and vision applications, we propose a transformer-based GAN model for image super-resolution.


  • C1 (30%): 10% Written + 20% Practice
  • C2 (30%): 10% Written + 20% Practice
  • C3 (40%): 20% Written + 20% Practice


  • Computer Programming
  • Data Structures and Algorithms
  • Machine Learning
  • Image and Video Processing
  • Ability to deal with abstract mathematical concepts


The content (text, image, and graphics) used in this slide are adopted from many sources for Academic purposes. Broadly, the sources have been given due credit appropriately. However, there is a chance of missing out some original primary sources. The authors of this material do not claim any copyright of such material.