Indian Institute of Information Technology, Allahabad

Computer Vision and Biometrics Lab (CVBL)

Visual Recognition

July-Dec 2023 Semester


Previous Offerings


Course Information

Objective of the course: The field of visual recognition has become part of our lives with applications in self-driving cars, satellite monitoring, surveillance, video analytics particularly in scene understanding, crowd behaviour analysis, action recognition etc. It has eased human lives by acquiring, processing, analyzing and understanding digital images and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information. The visual recognition encapsulates image classification, localization and detection. The course on visual recognition will help students understand new tools, techniques and methods which are influencing the visual recognition field.

Outcome of the course: At the end of this course, the students will be able apply the concepts to solve some real problems in recognition. The students will be able to use computational visual recognition for problems ranging from extracting features, classifying images, to detecting and outlining objects and activities in an image or video using machine learning and deep learning concepts. The student will be also being able to invent new methods in visual recognition for various applications.



Class meets
Thursday: 09.00 AM - 11.00 AM, Thursday: 03.00 PM - 05.00 PM

Course Ethics
  • Students are strictly advised to avoid the unethical practices in the course including review tests and practice components.
  • The project component will be done in team. The team will be formed by the course instructors. The project allotment will be also done by the course instructors.
  • Students are not allowed to simply claim the existing solutions available in public domain as your own work in this course.
  • If it happens that you have already done the similar projects in any other course or with any other faculty which is allotted to you, you should immediately inform us for the same as it is not allowed to have similar projects in this course which you might have already done previously.
  • It is best to try to solve problems on your own, since problem solving is an important component of the course.
  • You are not allowed to do or continue same project in any other course and with any other faculty.
  • You are allowed to discuss class material, problems, and general solution strategies with your classmates. But, when it comes to formulating or writing solutions you must work/implement by yourself.
  • You may use free and publicly available sources, such as books, journal and conference publications, and web pages, as research material for your answers. (You will not lose marks for using external sources.) It is does not mean that you claim these existing resources as your work.
  • You may not use any paid service and you must clearly and explicitly cite all outside sources and materials that you made use of.
  • Students are not allowed to post the code/report/any other material of course project in public domain or share with any one else without written permission from course instructors.
  • We consider the use of uncited external sources as portraying someone else's work as your own, and as such it is a violation of the Institute's policies on academic dishonesty.
  • Instances will be dealt with harshly and typically result in a failing course grade.

Schedule

Schedule Topic Resources
L01:Course Introduction Slide
L02:Local Features: What, Why and How Slide
L03:Corner Detection Slide
L04:Harris Detector and Invariance Property Slide
L05:Blob and Region Detection Slide
L06:Region Descriptors Slide
L07:Local Descriptors Slide
L08:Image Categorization Slide
L09:Image Classifiers Slide
L10:Neural Networks Slide
L11:Convolutional Neural Networks Slide
L12:CNN Training 1 Slide
L13:CNN Training 2 Slide
L14:CNN Architectures 1 Slide
L15:CNN Architectures 2 Slide
L16:Object Detection Slide
L17:Semantic Segmentation Slide
L18:Adversarial Attack Slide
L19:Generative Models Slide
L20:Transformer Models Slide
L21:Video Recognition Slide

Computational Projects Added to Teaching Laboratories

Project ID Team Project Title Abstract
VLR23-P01 IIT2020011 ANKIT KUMAR Image Super-resolution SwinIR gives good results for the task of image super resolution. In the paper we explain the architecture of swinir and provide comparison of performances of different techniques.
VLR23-P02 IIB2020008 SAMRIDDHI V WALIAm IIB2020014 MOHAN LAL AGARWALA, IIB2020502 ANIRUDDH SHARMA, IIT2020166 SHANTANU CHAUDHARY Human Counting in Crowded Scenerio using DETR In this report, we present a Human Detection and Counting System developed using YOLOv3, a state-of-the-art deep learning algorithm for real-time object detection. The primary objective of this system is to provide efficient and accurate human detection in various surveillance scenarios, ranging from retail space monitoring to crowd management in public transportation systems. The importance of this system is underscored by its potential applications in public safety and health, particularly in contexts like monitoring crowd sizes for disease control purposes. The YOLOv3 algorithm is chosen for its balance between speed and accuracy, making it suitable for real-time application scenarios. Our system demonstrates its capability to effectively detect and count humans in diverse and dynamic environments, highlighting its potential as a versatile tool in surveillance and monitoring applications.
VLR23-P03 MML2022001 RUPESH G, MML2022004 RAJ AHAMED SHAIK, MML2022016 ASHUTOSH VERMA Single Image Dehazing Image dehazing is the process of generating clear and haze-free from hazy photographs. Despite the fact that convolutional neural networks are commonly used for this task, image dehazing has yet to gain from recent breakthroughs in high-level vision problems using vision Transformers. This paper’s authors investigate the usage of Swin Transformer for image dehazing and propose DehazeFormer, which comprises changes to the normalisation layer, activation function, and spatial information aggregation approach. Many iterations of DehazeFormer were trained on different datasets to demonstrate its efficiency. The large model outperformed all previous stateof- the-art approaches on the SOTS indoor set, whereas the small model outperformed FFA-Net with a substantially lesser number of parameters and computational cost. The approach’s efficiency on severely non-homogeneous haze was further evaluated using a large realistic remote sensing dehazing dataset acquired by the researchers.
VLR23-P04 IIT2020018 BOTTE SHREYA, IIT2020040 KATAM BALA PRASANNA BABU, IIT2020199 VELPULA VAMSHI, IIT2020217 VELAGANA NAGENDRA, IIT2020255 DONTHOJU RAGHAVA Cross Day-Night Image Classification Image classification under cross day and night scenarios is a challenging problem in computer vision. The challenge of training a model on daytime photos from six distinct classes and assessing its performance on nighttime images from the same classes is covered in detail in this article. In addition to reviewing pertinent literature, describing the dataset, outlining the approach, and presenting experimental findings, we also explore the problems that this endeavour presents. The purpose of the study is to clarify any potential obstacles and solutions to this issue.
VLR23-P05 IIT2020173 ANISH JAIN, IIT2020181 JINIYA SINGAL, IIT2020182 DABERAO AKSHAY GAJANAN, IIT2020185 PATEL SAURABH, IIT2020188 SOLANKI TANMAY MOHANBHAI Hand Gesture Recognition using Deep CNN Hand gesture recognition is a critical component that offers a natural and intuitive means of communication with machines. In this paper, we are presenting a novel approach to automated hand gesture recognition utilizing a deep convolutional neural network model. This model is designed to address the challenges like variations in hand poses, complex backgrounds, and lighting conditions, so that it can works on real-world applications.
VLR23-P06 IIT2020031 RAUNAK KRISHAN JAISWAL, IIT2020033 ADITYA BISWAKARMA, IIT2020055 SAURABH KUMAR, IIT2020106 NEEL PATEL, IIT2020243 AKULA ABHIRAM Facial Micro-Expression Recognition using Deep Learning Techniques In computer vision, micro-expression (ME) detection refers to the process of detecting micro facial expressions in still and moving images and videos. Our work presents a novel CNNbased method based on ME datasets to measure real emotions. MEs are very brief and unconscious to notice even by humans and reverals hidden emotions of inner mind thereby presents a challenging and potential are of research.Our model aims at improving accuracy by overcoming the limitations in the present method. As final-year BTech students in India, our research contributes to understanding hidden emotions, paving the way for future investigations across various applications.
VLR23-P07 IIT2020005 PUSHKAL MADAAN, IIT2020006 RITEJ DHAMALA, IIT2020008 AVISHKAR SINGH, IIT2020077 ANUSHKA AJIT DANDAWATE, IIT2020252 KAVITA Self-Supervised Image Retrieval In this paper we purpose a self-supervised image retrieval system to effectively and efficiently be able to work on large, unlabeled datasets, specifically on the satellite image dataset called UC Merced dataset. This system leverages a pre-trained ResNet50 model which helps our main Siamese network efficiently learn from the semantic similarities of our unlabeled data. In addition, we have used KNN to retrieve most similar images to a queried test image and measured its top-5 accuracy. The study also looks into how deep neural architectures might improve self-supervised retrieval systems. Design decisions that might impact the effectiveness of self-supervised models are examined, including architectural options, model complexity, and transferability between datasets.
VLR23-P08 MML2022002 HARSH, MML2022011 UMESH MAURYA Tiny Face Detection This project proposes an innovative approach for the detection of small faces in photographs by leveraging the power of Generative Adversarial Networks (GANs). Recognizing small and diminutive faces in real-world images has proven to be a challenging task using existing face identification methods. To address this challenge, a two-stage methodology is introduced, where a GAN is initially employed to generate high-resolution images of small faces. The GAN is trained on an extensive dataset of facial photos and learns to generate high-quality images of small faces while considering both the input image and the desired face size. These artificially generated small face images serve as a valuable resource for augmenting the training data of a face detection model. This face detector is trained separately using a distinct set of images. By incorporating the generated images into the training dataset, the face detection model becomes more adept at accurately identifying small faces in photographs, enhancing its overall performance.
VLR23-P09 IIB2020030 MANISH KUMAR, IIT2020021 HARSHITA VYAS, IIT2020037 SAKSHI, IIT2020095 AMBIKESH ARMAN, IIT2020134 SHAH KRISHNA DINESHKUMAR Image Denoising using Image-to-Image Translation Image denoising, a crucial feature in today’s visual technology, involves the elimination of unwanted noise from images. Despite modern cameras capturing high-resolution pictures, the challenge persists in obtaining noise-free images. This necessitates pre-processing or post-processing techniques to diminish noise without compromising image quality. Our approach involves leveraging ”Autoencoder” technology for image denoising. Autoencoders possess the unique ability to self-learn from provided data, generating a model based on data rather than predefined filters. Additionally, they ensure consistency by delivering the same output as input, affirming image quality preservation. Though computational time remains a potential concern, the benefits of employing Autoencoders in denoising applications, from medical systems to smartphone image enhancement, are substantial. This paper explores the utilization of Autoencoders, a deep learning technique employing downsampling and up-sampling methods, as a solution to the denoising problem. Keywords: image, noise, autoencoder, denoise.
VLR23-P10 IIB2020036 MIRIYALA POOJITHA, IIT2020144 PRANAV RAJ, IIT2020151 SHIVAM KATIYAR, IIT2020163 SARTHAK DALMIA, IIT2020205 ADITYA RAJ Drowsiness Detection using Faces Drowsiness has become the focus of researchers' attention in recent years as it is the cause of many traffic accidents. It can also help determine the need for rest. Drivers experience fatigue when they drive for long periods of time, which affects their driving ability and can lead to death and injuries in car accidents. Fatigue can be caused by long driving, discomfort, headache, alcohol and drugs, etc. it could be. This research will play an important role in the lives of drivers and could save their lives. In this article, I introduce an Android application that can detect sleep, activity and blink count. The app will sound an alarm when the driver falls asleep and could save the driver's life. The app will provide five data: state percentage, sleep, blink count, yawn count, and number of frames captured by the app's camera photo. It retrieves login information, then provides it and sounds a warning if the driver is drowsy.
VLR23-P11 IIT2020227 MOHD WASIF, IIT2020242 MOHD SARFARAZ, IIT2020247 SANJAY RAM, IIT2020254 CHAUDHARI YOGIRAJ PRAKASH, IIT2020259 ANKIT KUMAR Photo ID Retrieval from Arbitrary Face Query The ”Photo ID Retrieval from Arbitrary Face Query” project aims to develop a sophisticated face recognition system capable of identifying individuals based on their facial features. The project uses the CNN model for feature extraction and evaluates the system’s performance on the LFW (Labeled Faces in theWild) dataset, a widely recognized benchmark for face recognition. This report provides a comprehensive overview of the system, including data collection, preprocessing, feature extraction, database creation, face query process, visualization, experimental results, and a thorough discussion.
VLR23-P12 IIB2020016 ANURAG HARSH, IIB2020018 ABHISHEK KUMAR, IIB2020024 VAIDIK SHARMA, IIB2020027 AMAN UTKARSH, IIT2020140 AYUSHI Image Caption Generation Deep learning model for Image Caption Generation. This model takes images as input and then generates captions for those images. We have made use of transfer learning, word embeddings and custom data generators for building this model. We have evaluated the relevancy of generated captions using BLEU score and computed both the individual and cumulative BLEU scores. We have used python,keras and tensorflow for the development of this model.
VLR23-P13 IIT2020052 SANJEET, IIT2020053 SAMEER AHMED, IIT2020082 HARSH GARG, IIT2020218 JITU RAJAK, IIT2020244 RAHUL Selfie vs Non-selfie Classification In the era of ubiquitous smartphone usage and social media platforms, the line between personal and nonpersonal images has blurred significantly. Selfies, which are selfportraits typically taken with a smartphone camera, have become a ubiquitous form of self-expression. However, distinguishing between selfie and non-selfie images automatically presents an interesting and challenging problem in computer vision. This semester project targets to address the problem by developing a robust image classification system that can accurately differentiate between selfie and non-selfie images.
VLR23-P14 MML2022009 MANISH KUMAR, MML2022013 BHAVESH KUMAR BOHARA, MML2022014 KAVATHIYA KHYATI HARESHBHAI Image Deraining It can be challenging to remove rain streaks from a single photograph because rainy photos usually include rain. streaks of different densities, sizes, shapes, and directions. Most Current methods for deraining use a deep network that adheres to a broad model. Low-level characteristics are recorded by a ”encoder-decoder” design. above the first layers, with elevated characteristics at a deeper level. The must-be rain streaks be taken out for the purpose of deraining are quite small, thus Stressing global aspects is not always an effective strategy. to resolve the issue. Therefore, in this essay, We suggest utilizing a convolutional network architecture that is excessively complex emphasizes understanding local structures. By restricting the filters’ receptive field.In order to compute the derained image, we combine it with U-Net to ensure that it concentrates more on lowlevel features and does not overlook global structures. The overand- under complete deraining network (OUCD) is a suggested network that is split into two branches: an undercomplete branch that focuses on global structures and an overcomplete branch that focuses on local structures with bigger receptive fields. Numerous experiments on artificial and real-world datasets demonstrate that the proposed strategy performs better than the most recent state-of-the-art techniques.
VLR23-P15 IIT2020158 S ANURAG REDDY, IIT2020164 SAVALA DEEPIKA, IIT2020213 ANKADALA JEEVAN, IIT2020250 PULUKURI JAGADEESH, IIT2020266 NENAVATH ABHIRAM NAIK Homography Matrix Computation between Images using Deep Learning We introduce a deep convolutional neural network designed to approximate the relative homography of two images.Ten layers make up our feed-forward network, which requires two stacked with grayscale pictures as input, it generates eight degrees of freedom homography that enables the mapping of the pixels from the first to the second picture. We offer two convolutional neural networks.HomographyNet's network architectures: a regression network.It computes the real-valued homography parameters directly as well as a classification network that generates a distribution across homographies quantized.We employ a 4-point homography parameterization, which involves projecting the four corners of one image onto the other. Our networks are trained with distorted MS-COCO pictures in an end-to-end manner. Our method functions without requiring distinct phases for transformation estimates and local feature detection. We compare our deep models with a conventional homography estimator based on ORB features, and we show the situations in which HomographyNet performs better than the conventional method. We further highlight the versatility of a deep learning method by describing a range of applications driven by deep homography estimation.
VLR23-P16 IIT2020044 PRIYA DEVI, IIT2020060 PERISETLA SRI SATWIK, IIT2020065 DASA AKSHITHA, IIT2020196 KALYANI BHUSHAN PHARKANDEKAR, IIT2020208 MARPINA SRUJANA Face Recognition from Partial Faces Partially hidden or missing facial traits can be used to identify people using partial face recognition, a crucial branch of facial recognition technology. Due to its potential uses in security, surveillance, and human-computer interaction, this specialized topic has grown in popularity. People frequently exhibit their faces under a variety of circumstances, such as partial occlusion by masks, accessories, or poor lighting, hence the capacity to identify and confirm persons using constrained facial information is crucial in practice. This article examines the difficulties, approaches, and developments in partial face recognition, providing information on how it is changing in the context of biometric identification and surveillance systems.
VLR23-P17 MHC2022001 AMIT ROY, MHC2022011 BHARGAV BURMAN, MHC2022013 HARSHIT GUPTA, MML2022003 DIPANKAR KARMAKAR Plant Disease Classification The global population growth has resulted in a scarcity of essential resources such as raw materials and food supply. The agricultural industry has emerged as the primary and key source for addressing this specific issue. However, the agricultural industry as a whole faces significant challenges due to the presence of pests and various crop diseases.I would like to kindly request that you rewrite my previous text in a more academic manner. Plant diseases provide a significant challenge to global agriculture, leading to substantial crop losses. The complexity and difficulty of recognising illnesses in plants can be attributed to a dearth of specialised expertise. The utilisation of deep learningbased models enables the diagnosis of plant diseases through the analysis of leaf photographs.The primary challenges that remain to be addressed in these algorithms include the necessity for larger training sets, considerations related to computational complexity, the issue of overfitting, and other associated difficulties. This study centres on a novel machine learning model, derived from conventional neural networks (CNN), with the aim of enhancing its effectiveness. Additionally, it provides a concise summary of the existing published solutions in this domain.In order to increase the size of the training set without the need for additional photographs, several augmentation techniques such as shift, shear, scaling, zooming, and flipping are employed. These approaches serve to produce additional samples, hence enlarging the training set. The Convolutional Neural Network (CNN) model has been trained on a publically accessible dataset called PlantVillage. The purpose of this training is to accurately detect and classify the presence of Early Blight and Late Blight illnesses in potato leaves.
VLR23-P18 MHC2022005 DASAROJU JAGANNADHACHARI, MML2022005 PRAGATI, MML2022007 SAYANTAN CHAKRABORTY, MRM2022006 BEHERA JYOTHIKRISHNA Image Inpainting Using GAN Image Inpainting is an important topic of research in the field of image processing. The prime goal of image inpainting is to recover missing details in an image, demosaicing the image etc. In this paper we discuss progress of the image inpainting project using deep learning models, namely Generative Adversarial Network. The project has been implemented in the environment of google colaboratory.
VLR23-P19 IIT2020025 MANPREET SINGH, IIT2020032 KARTIK GUPTA, IIT2020219 Tanu Shree Suthar, IIT2020221 TUSHAR AGGARWAL Visual Grounding using CNNs Computer visions algorithms are used to predict a limited number of object types, limiting their generality and applicability. Learning directly from raw text about images is a potential option that takes advantage of a larger supply of supervision. On a large dataset of ~million image and text pairs acquired from the internet, a simple pre-training job of predicting which caption goes with which image is an efficient and scalable technique to train on State of art image representations from scratch. Following pre-training, natural language is used to refer to previously learned visual concepts or explain new ones, allowing for zero-shot model transfer to downstream tasks. The model easily applies to most tasks and is frequently competitive with a fully supervised baseline without requiring dataset-specific training.
VLR23-P20 IIT2020009 AASHISH AGRAWAL, IIT2020010 RAJ CHHARI, IIT2020183 LOKESH MEHTA, IIT2020209 AADITYA RATHOD, IIT2020505 AKSHAT GHARIYA Viewpoint Invariant Scene Recognition of IIITA Campus using Deep Learning In this paper, we use deep learning approaches to handle the challenge of perspective invariant scene detection on the campus of Indian Institute of Information Technology Allahabad (IIITA). The principal aim is to create a clever computer system that can identify different sights and landmarks on the IIITA campus, all the while maintaining resilience to changes in perspective. We investigate the creation and use of a Convolutional Neural Network (CNN) model specifically designed for picture classification. Our main goal is to use this model to categorize photos from the ”Campus Images Dataset,” which is a set of ten different categories. The campus is covered by these categories, which include admin, adminback, audi, cafeteria, cc2, cc3, cc3back, library, mandir, and rm.
VLR23-P21 IIB2020021 GAGAN BANSAL, MML2022006 MOHD FAIZ ANSARI, MML2022008 RAKSHIT SANDILYA, MML2022010 NIKHIL RAJPUT, MML2022012 HIMANSHU MITTAL Thermal to Visible Image Translation This review report details the implementation and assessment of a pix2pix GAN for thermal to visual-picture translation. Deep learning algorithms will be used in the project to produce high-quality visual images from thermal photos. An overview of the methodology, data collection, and pre-processing procedures is provided in the report. The SSIM and PSNR measurements were used to assess the model. The outcomes demonstrate that the suggested method is successful in converting thermal images into realistic and excellent visual images. This method may be used in a variety of industries, including surveillance, search and rescue, and medical imaging. Overall, the article shows how well pix2pix GAN works for picture translation tasks and offers suggestions for further study in this field.
VLR23-P22 IIT2020154 SHIVEK PAMNANI, IIT2020160 ANUSHKA ARUN KALWALE, IIT2020179 KARUS MANISHA, IIT2020189 ROUNAK DEV, IIT2020190 MALYALA MEGHAMSH Clothing Outfit Rating using CNNs The fashion industry is changing, and it’s all because of the internet. The way we shop and see what everyone else is wearing has shifted to online platforms. Now, we need a system that tells us how good our outfit is. With this digital transformation came a need for automated clothing outfit rating systems. The one presented in this paper uses Convolutional Neural Networks (CNNs). Using deep learning and computer vision techniques, our system analyzes and evaluates clothing outfits based on many visual features. Some of those include color combinations, clothing styles, and overall aesthetics. To make it work, we took a pre-trained CNN architecture and fine-tuned it with a large dataset of labeled clothing outfits. The methodology works like this: the system first takes each piece of clothing in an outfit to extract feature representations from them. Then they’re combined to give the entire outfit an overall rating. We tested the performance using countless outfits with different styles, colors, and more. The results showed that our approach was highly effective at providing accurate ratings that have meaning behind them.
VLR23-P23 MRM2022002 AKASH TYAGI, MRM2022003 ANKIT RAJ RAVI, MRM2022004 ADITYA, MRM2022005 HIMANSHU MISHRA Impact of Different Activation Functions on ViT Model The advent of Visual Transformer (ViT) models has heralded a novel approach in handling image data, veering from traditional Convolutional Neural Networks (CNNs) towards leveraging transformer architectures. A significant factor influencing the ViT model’s performance and training dynamics is the choice of activation functions, which induce the requisite nonlinearity for complex pattern recognition. This study embarks on an in-depth examination of various activation functions to discern their impact on the ViT model’s effectiveness, training dynamics, and computational efficiency across multiple datasets. The aim is to furnish a nuanced understanding of how activation functions affect the learning, generalization, and robustness of ViT models, and provide empirical guidelines for their optimal selection in different computer vision applications. Our findings elucidate the critical role of activation functions, offering valuable insights for the enhanced tuning and optimization of ViT models in computer vision tasks.
VLR23-P24 IIT2020007 SHUBHAM KUMAR BHOKTA, IIT2020022 RAHUL MAHTO, IIT2020024 SHASHIKANT THAKUR, IIT2020043 ROHIT CHOWDHURY, IIT2020220 MOHIT KUMAR Identification of Artificially Generated Images In our proposed approach, we use Deep Convolutional Neural Networks (DNNs), in particular the ResNet structure, to distinguish between real and fake images. We concentrate on using unique patterns, and features found in the pixel level and structural properties of the generated images.
VLR23-P25 IIT2020067 ADITYA SINGH, IIT2020070 EKAGRA SINHA, IIT2020089 DEVESH KUMAR PARTE, IIT2020101 LUKESH NITIN PATIL, IIT2020105 JAMBHULE SAHAS DEVIDAS Student Counting in Classroom In this paper we propose a model to count the number of students in a classroom like environment to estimate the count. This would be helpful to estimate the crowd density and may be helpful in making some decisions or predictions.
VLR23-P26 RSI2022502 AJAY KUMAR YADAV Analysis of Robustness in Deep Learning Models
VLR23-P27 RSI2023001 AKASH VERMA Efficient ViT Models for Small-scale Datasets

Grading

Prerequisites

Books

Disclaimer

The content (text, image, and graphics) used in this slide are adopted from many sources for Academic purposes. Broadly, the sources have been given due credit appropriately. However, there is a chance of missing out some original primary sources. The authors of this material do not claim any copyright of such material.