Indian Institute of Information Technology, Allahabad

Computer Vision and Biometrics Lab (CVBL)

Visual Recognition

July-Dec 2023 Semester

Course Information
Schedule
Computational Projects
Grading
Prerequisites
External Resources
Visual Recognition Course
CVBL Courses
CVBL Home

Previous Offerings

Course Information

Objective of the course: The field of visual recognition has become part of our lives with applications in self-driving cars, satellite monitoring, surveillance, video analytics particularly in scene understanding, crowd behaviour analysis, action recognition etc. It has eased human lives by acquiring, processing, analyzing and understanding digital images and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information. The visual recognition encapsulates image classification, localization and detection. The course on visual recognition will help students understand new tools, techniques and methods which are influencing the visual recognition field.

Outcome of the course: At the end of this course, the students will be able apply the concepts to solve some real problems in recognition. The students will be able to use computational visual recognition for problems ranging from extracting features, classifying images, to detecting and outlining objects and activities in an image or video using machine learning and deep learning concepts. The student will be also being able to invent new methods in visual recognition for various applications.

Course Instructors

Dr. Satish Kumar Singh

Dr. Shiv Ram Dubey

Class meets: Thursday: 09.00 AM - 11.00 AM, Thursday: 03.00 PM - 05.00 PM

Course Ethics

Schedule

Schedule	Topic	Resources
L01:	Course Introduction Slide	Computer Vision: Algorithms and Applications, Richard Szeliski, Springer Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press
L02:	Local Features: What, Why and How Slide	Python Hands-On
L03:	Corner Detection Slide
L04:	Harris Detector and Invariance Property Slide
L05:	Blob and Region Detection Slide
L06:	Region Descriptors Slide
L07:	Local Descriptors Slide
L08:	Image Categorization Slide
L09:	Image Classifiers Slide
L10:	Neural Networks Slide
L11:	Convolutional Neural Networks Slide
L12:	CNN Training 1 Slide
L13:	CNN Training 2 Slide
L14:	CNN Architectures 1 Slide
L15:	CNN Architectures 2 Slide
L16:	Object Detection Slide
L17:	Semantic Segmentation Slide
L18:	Adversarial Attack Slide
L19:	Generative Models Slide
L20:	Transformer Models Slide
L21:	Video Recognition Slide

Topic

Resources

L01:

Course Introduction Slide

L02:

Local Features: What, Why and How Slide

Python Hands-On

L03:

Corner Detection Slide

L04:

Harris Detector and Invariance Property Slide

L05:

Blob and Region Detection Slide

L06:

Region Descriptors Slide

L07:

Local Descriptors Slide

L08:

Image Categorization Slide

L09:

Image Classifiers Slide

L10:

Neural Networks Slide

L11:

Convolutional Neural Networks Slide

L12:

CNN Training 1 Slide

L13:

CNN Training 2 Slide

L14:

CNN Architectures 1 Slide

L15:

CNN Architectures 2 Slide

L16:

Object Detection Slide

L17:

Semantic Segmentation Slide

L18:

Adversarial Attack Slide

L19:

Generative Models Slide

L20:

Transformer Models Slide

L21:

Video Recognition Slide

Computational Projects Added to Teaching Laboratories

Project ID	Team	Project Title	Abstract
VLR23-P01	IIT2020011 ANKIT KUMAR	Image Super-resolution	SwinIR gives good results for the task of image super resolution. In the paper we explain the architecture of swinir and provide comparison of performances of different techniques.
VLR23-P02	IIB2020008 SAMRIDDHI V WALIAm IIB2020014 MOHAN LAL AGARWALA, IIB2020502 ANIRUDDH SHARMA, IIT2020166 SHANTANU CHAUDHARY	Human Counting in Crowded Scenerio using DETR	In this report, we present a Human Detection and Counting System developed using YOLOv3, a state-of-the-art deep learning algorithm for real-time object detection. The primary objective of this system is to provide efficient and accurate human detection in various surveillance scenarios, ranging from retail space monitoring to crowd management in public transportation systems. The importance of this system is underscored by its potential applications in public safety and health, particularly in contexts like monitoring crowd sizes for disease control purposes. The YOLOv3 algorithm is chosen for its balance between speed and accuracy, making it suitable for real-time application scenarios. Our system demonstrates its capability to effectively detect and count humans in diverse and dynamic environments, highlighting its potential as a versatile tool in surveillance and monitoring applications.
VLR23-P03	MML2022001 RUPESH G, MML2022004 RAJ AHAMED SHAIK, MML2022016 ASHUTOSH VERMA	Single Image Dehazing	Image dehazing is the process of generating clear and haze-free from hazy photographs. Despite the fact that convolutional neural networks are commonly used for this task, image dehazing has yet to gain from recent breakthroughs in high-level vision problems using vision Transformers. This paper’s authors investigate the usage of Swin Transformer for image dehazing and propose DehazeFormer, which comprises changes to the normalisation layer, activation function, and spatial information aggregation approach. Many iterations of DehazeFormer were trained on different datasets to demonstrate its efficiency. The large model outperformed all previous stateof- the-art approaches on the SOTS indoor set, whereas the small model outperformed FFA-Net with a substantially lesser number of parameters and computational cost. The approach’s efficiency on severely non-homogeneous haze was further evaluated using a large realistic remote sensing dehazing dataset acquired by the researchers.
VLR23-P04	IIT2020018 BOTTE SHREYA, IIT2020040 KATAM BALA PRASANNA BABU, IIT2020199 VELPULA VAMSHI, IIT2020217 VELAGANA NAGENDRA, IIT2020255 DONTHOJU RAGHAVA	Cross Day-Night Image Classification	Image classification under cross day and night scenarios is a challenging problem in computer vision. The challenge of training a model on daytime photos from six distinct classes and assessing its performance on nighttime images from the same classes is covered in detail in this article. In addition to reviewing pertinent literature, describing the dataset, outlining the approach, and presenting experimental findings, we also explore the problems that this endeavour presents. The purpose of the study is to clarify any potential obstacles and solutions to this issue.
VLR23-P05	IIT2020173 ANISH JAIN, IIT2020181 JINIYA SINGAL, IIT2020182 DABERAO AKSHAY GAJANAN, IIT2020185 PATEL SAURABH, IIT2020188 SOLANKI TANMAY MOHANBHAI	Hand Gesture Recognition using Deep CNN	Hand gesture recognition is a critical component that offers a natural and intuitive means of communication with machines. In this paper, we are presenting a novel approach to automated hand gesture recognition utilizing a deep convolutional neural network model. This model is designed to address the challenges like variations in hand poses, complex backgrounds, and lighting conditions, so that it can works on real-world applications.
VLR23-P06	IIT2020031 RAUNAK KRISHAN JAISWAL, IIT2020033 ADITYA BISWAKARMA, IIT2020055 SAURABH KUMAR, IIT2020106 NEEL PATEL, IIT2020243 AKULA ABHIRAM	Facial Micro-Expression Recognition using Deep Learning Techniques	In computer vision, micro-expression (ME) detection refers to the process of detecting micro facial expressions in still and moving images and videos. Our work presents a novel CNNbased method based on ME datasets to measure real emotions. MEs are very brief and unconscious to notice even by humans and reverals hidden emotions of inner mind thereby presents a challenging and potential are of research.Our model aims at improving accuracy by overcoming the limitations in the present method. As final-year BTech students in India, our research contributes to understanding hidden emotions, paving the way for future investigations across various applications.
VLR23-P07	IIT2020005 PUSHKAL MADAAN, IIT2020006 RITEJ DHAMALA, IIT2020008 AVISHKAR SINGH, IIT2020077 ANUSHKA AJIT DANDAWATE, IIT2020252 KAVITA	Self-Supervised Image Retrieval	In this paper we purpose a self-supervised image retrieval system to effectively and efficiently be able to work on large, unlabeled datasets, specifically on the satellite image dataset called UC Merced dataset. This system leverages a pre-trained ResNet50 model which helps our main Siamese network efficiently learn from the semantic similarities of our unlabeled data. In addition, we have used KNN to retrieve most similar images to a queried test image and measured its top-5 accuracy. The study also looks into how deep neural architectures might improve self-supervised retrieval systems. Design decisions that might impact the effectiveness of self-supervised models are examined, including architectural options, model complexity, and transferability between datasets.
VLR23-P08	MML2022002 HARSH, MML2022011 UMESH MAURYA	Tiny Face Detection	This project proposes an innovative approach for the detection of small faces in photographs by leveraging the power of Generative Adversarial Networks (GANs). Recognizing small and diminutive faces in real-world images has proven to be a challenging task using existing face identification methods. To address this challenge, a two-stage methodology is introduced, where a GAN is initially employed to generate high-resolution images of small faces. The GAN is trained on an extensive dataset of facial photos and learns to generate high-quality images of small faces while considering both the input image and the desired face size. These artificially generated small face images serve as a valuable resource for augmenting the training data of a face detection model. This face detector is trained separately using a distinct set of images. By incorporating the generated images into the training dataset, the face detection model becomes more adept at accurately identifying small faces in photographs, enhancing its overall performance.
VLR23-P09	IIB2020030 MANISH KUMAR, IIT2020021 HARSHITA VYAS, IIT2020037 SAKSHI, IIT2020095 AMBIKESH ARMAN, IIT2020134 SHAH KRISHNA DINESHKUMAR	Image Denoising using Image-to-Image Translation	Image denoising, a crucial feature in today’s visual technology, involves the elimination of unwanted noise from images. Despite modern cameras capturing high-resolution pictures, the challenge persists in obtaining noise-free images. This necessitates pre-processing or post-processing techniques to diminish noise without compromising image quality. Our approach involves leveraging ”Autoencoder” technology for image denoising. Autoencoders possess the unique ability to self-learn from provided data, generating a model based on data rather than predefined filters. Additionally, they ensure consistency by delivering the same output as input, affirming image quality preservation. Though computational time remains a potential concern, the benefits of employing Autoencoders in denoising applications, from medical systems to smartphone image enhancement, are substantial. This paper explores the utilization of Autoencoders, a deep learning technique employing downsampling and up-sampling methods, as a solution to the denoising problem. Keywords: image, noise, autoencoder, denoise.
VLR23-P10	IIB2020036 MIRIYALA POOJITHA, IIT2020144 PRANAV RAJ, IIT2020151 SHIVAM KATIYAR, IIT2020163 SARTHAK DALMIA, IIT2020205 ADITYA RAJ	Drowsiness Detection using Faces	Drowsiness has become the focus of researchers' attention in recent years as it is the cause of many traffic accidents. It can also help determine the need for rest. Drivers experience fatigue when they drive for long periods of time, which affects their driving ability and can lead to death and injuries in car accidents. Fatigue can be caused by long driving, discomfort, headache, alcohol and drugs, etc. it could be. This research will play an important role in the lives of drivers and could save their lives. In this article, I introduce an Android application that can detect sleep, activity and blink count. The app will sound an alarm when the driver falls asleep and could save the driver's life. The app will provide five data: state percentage, sleep, blink count, yawn count, and number of frames captured by the app's camera photo. It retrieves login information, then provides it and sounds a warning if the driver is drowsy.
VLR23-P11	IIT2020227 MOHD WASIF, IIT2020242 MOHD SARFARAZ, IIT2020247 SANJAY RAM, IIT2020254 CHAUDHARI YOGIRAJ PRAKASH, IIT2020259 ANKIT KUMAR	Photo ID Retrieval from Arbitrary Face Query	The ”Photo ID Retrieval from Arbitrary Face Query” project aims to develop a sophisticated face recognition system capable of identifying individuals based on their facial features. The project uses the CNN model for feature extraction and evaluates the system’s performance on the LFW (Labeled Faces in theWild) dataset, a widely recognized benchmark for face recognition. This report provides a comprehensive overview of the system, including data collection, preprocessing, feature extraction, database creation, face query process, visualization, experimental results, and a thorough discussion.
VLR23-P12	IIB2020016 ANURAG HARSH, IIB2020018 ABHISHEK KUMAR, IIB2020024 VAIDIK SHARMA, IIB2020027 AMAN UTKARSH, IIT2020140 AYUSHI	Image Caption Generation	Deep learning model for Image Caption Generation. This model takes images as input and then generates captions for those images. We have made use of transfer learning, word embeddings and custom data generators for building this model. We have evaluated the relevancy of generated captions using BLEU score and computed both the individual and cumulative BLEU scores. We have used python,keras and tensorflow for the development of this model.
VLR23-P13	IIT2020052 SANJEET, IIT2020053 SAMEER AHMED, IIT2020082 HARSH GARG, IIT2020218 JITU RAJAK, IIT2020244 RAHUL	Selfie vs Non-selfie Classification	In the era of ubiquitous smartphone usage and social media platforms, the line between personal and nonpersonal images has blurred significantly. Selfies, which are selfportraits typically taken with a smartphone camera, have become a ubiquitous form of self-expression. However, distinguishing between selfie and non-selfie images automatically presents an interesting and challenging problem in computer vision. This semester project targets to address the problem by developing a robust image classification system that can accurately differentiate between selfie and non-selfie images.
VLR23-P14	MML2022009 MANISH KUMAR, MML2022013 BHAVESH KUMAR BOHARA, MML2022014 KAVATHIYA KHYATI HARESHBHAI	Image Deraining	It can be challenging to remove rain streaks from a single photograph because rainy photos usually include rain. streaks of different densities, sizes, shapes, and directions. Most Current methods for deraining use a deep network that adheres to a broad model. Low-level characteristics are recorded by a ”encoder-decoder” design. above the first layers, with elevated characteristics at a deeper level. The must-be rain streaks be taken out for the purpose of deraining are quite small, thus Stressing global aspects is not always an effective strategy. to resolve the issue. Therefore, in this essay, We suggest utilizing a convolutional network architecture that is excessively complex emphasizes understanding local structures. By restricting the filters’ receptive field.In order to compute the derained image, we combine it with U-Net to ensure that it concentrates more on lowlevel features and does not overlook global structures. The overand- under complete deraining network (OUCD) is a suggested network that is split into two branches: an undercomplete branch that focuses on global structures and an overcomplete branch that focuses on local structures with bigger receptive fields. Numerous experiments on artificial and real-world datasets demonstrate that the proposed strategy performs better than the most recent state-of-the-art techniques.
VLR23-P15	IIT2020158 S ANURAG REDDY, IIT2020164 SAVALA DEEPIKA, IIT2020213 ANKADALA JEEVAN, IIT2020250 PULUKURI JAGADEESH, IIT2020266 NENAVATH ABHIRAM NAIK	Homography Matrix Computation between Images using Deep Learning	We introduce a deep convolutional neural network designed to approximate the relative homography of two images.Ten layers make up our feed-forward network, which requires two stacked with grayscale pictures as input, it generates eight degrees of freedom homography that enables the mapping of the pixels from the first to the second picture. We offer two convolutional neural networks.HomographyNet's network architectures: a regression network.It computes the real-valued homography parameters directly as well as a classification network that generates a distribution across homographies quantized.We employ a 4-point homography parameterization, which involves projecting the four corners of one image onto the other. Our networks are trained with distorted MS-COCO pictures in an end-to-end manner. Our method functions without requiring distinct phases for transformation estimates and local feature detection. We compare our deep models with a conventional homography estimator based on ORB features, and we show the situations in which HomographyNet performs better than the conventional method. We further highlight the versatility of a deep learning method by describing a range of applications driven by deep homography estimation.
VLR23-P16	IIT2020044 PRIYA DEVI, IIT2020060 PERISETLA SRI SATWIK, IIT2020065 DASA AKSHITHA, IIT2020196 KALYANI BHUSHAN PHARKANDEKAR, IIT2020208 MARPINA SRUJANA	Face Recognition from Partial Faces	Partially hidden or missing facial traits can be used to identify people using partial face recognition, a crucial branch of facial recognition technology. Due to its potential uses in security, surveillance, and human-computer interaction, this specialized topic has grown in popularity. People frequently exhibit their faces under a variety of circumstances, such as partial occlusion by masks, accessories, or poor lighting, hence the capacity to identify and confirm persons using constrained facial information is crucial in practice. This article examines the difficulties, approaches, and developments in partial face recognition, providing information on how it is changing in the context of biometric identification and surveillance systems.
VLR23-P17	MHC2022001 AMIT ROY, MHC2022011 BHARGAV BURMAN, MHC2022013 HARSHIT GUPTA, MML2022003 DIPANKAR KARMAKAR	Plant Disease Classification	The global population growth has resulted in a scarcity of essential resources such as raw materials and food supply. The agricultural industry has emerged as the primary and key source for addressing this specific issue. However, the agricultural industry as a whole faces significant challenges due to the presence of pests and various crop diseases.I would like to kindly request that you rewrite my previous text in a more academic manner. Plant diseases provide a significant challenge to global agriculture, leading to substantial crop losses. The complexity and difficulty of recognising illnesses in plants can be attributed to a dearth of specialised expertise. The utilisation of deep learningbased models enables the diagnosis of plant diseases through the analysis of leaf photographs.The primary challenges that remain to be addressed in these algorithms include the necessity for larger training sets, considerations related to computational complexity, the issue of overfitting, and other associated difficulties. This study centres on a novel machine learning model, derived from conventional neural networks (CNN), with the aim of enhancing its effectiveness. Additionally, it provides a concise summary of the existing published solutions in this domain.In order to increase the size of the training set without the need for additional photographs, several augmentation techniques such as shift, shear, scaling, zooming, and flipping are employed. These approaches serve to produce additional samples, hence enlarging the training set. The Convolutional Neural Network (CNN) model has been trained on a publically accessible dataset called PlantVillage. The purpose of this training is to accurately detect and classify the presence of Early Blight and Late Blight illnesses in potato leaves.
VLR23-P18	MHC2022005 DASAROJU JAGANNADHACHARI, MML2022005 PRAGATI, MML2022007 SAYANTAN CHAKRABORTY, MRM2022006 BEHERA JYOTHIKRISHNA	Image Inpainting Using GAN	Image Inpainting is an important topic of research in the field of image processing. The prime goal of image inpainting is to recover missing details in an image, demosaicing the image etc. In this paper we discuss progress of the image inpainting project using deep learning models, namely Generative Adversarial Network. The project has been implemented in the environment of google colaboratory.
VLR23-P19	IIT2020025 MANPREET SINGH, IIT2020032 KARTIK GUPTA, IIT2020219 Tanu Shree Suthar, IIT2020221 TUSHAR AGGARWAL	Visual Grounding using CNNs	Computer visions algorithms are used to predict a limited number of object types, limiting their generality and applicability. Learning directly from raw text about images is a potential option that takes advantage of a larger supply of supervision. On a large dataset of ~million image and text pairs acquired from the internet, a simple pre-training job of predicting which caption goes with which image is an efficient and scalable technique to train on State of art image representations from scratch. Following pre-training, natural language is used to refer to previously learned visual concepts or explain new ones, allowing for zero-shot model transfer to downstream tasks. The model easily applies to most tasks and is frequently competitive with a fully supervised baseline without requiring dataset-specific training.
VLR23-P20	IIT2020009 AASHISH AGRAWAL, IIT2020010 RAJ CHHARI, IIT2020183 LOKESH MEHTA, IIT2020209 AADITYA RATHOD, IIT2020505 AKSHAT GHARIYA	Viewpoint Invariant Scene Recognition of IIITA Campus using Deep Learning	In this paper, we use deep learning approaches to handle the challenge of perspective invariant scene detection on the campus of Indian Institute of Information Technology Allahabad (IIITA). The principal aim is to create a clever computer system that can identify different sights and landmarks on the IIITA campus, all the while maintaining resilience to changes in perspective. We investigate the creation and use of a Convolutional Neural Network (CNN) model specifically designed for picture classification. Our main goal is to use this model to categorize photos from the ”Campus Images Dataset,” which is a set of ten different categories. The campus is covered by these categories, which include admin, adminback, audi, cafeteria, cc2, cc3, cc3back, library, mandir, and rm.
VLR23-P21	IIB2020021 GAGAN BANSAL, MML2022006 MOHD FAIZ ANSARI, MML2022008 RAKSHIT SANDILYA, MML2022010 NIKHIL RAJPUT, MML2022012 HIMANSHU MITTAL	Thermal to Visible Image Translation	This review report details the implementation and assessment of a pix2pix GAN for thermal to visual-picture translation. Deep learning algorithms will be used in the project to produce high-quality visual images from thermal photos. An overview of the methodology, data collection, and pre-processing procedures is provided in the report. The SSIM and PSNR measurements were used to assess the model. The outcomes demonstrate that the suggested method is successful in converting thermal images into realistic and excellent visual images. This method may be used in a variety of industries, including surveillance, search and rescue, and medical imaging. Overall, the article shows how well pix2pix GAN works for picture translation tasks and offers suggestions for further study in this field.
VLR23-P22	IIT2020154 SHIVEK PAMNANI, IIT2020160 ANUSHKA ARUN KALWALE, IIT2020179 KARUS MANISHA, IIT2020189 ROUNAK DEV, IIT2020190 MALYALA MEGHAMSH	Clothing Outfit Rating using CNNs	The fashion industry is changing, and it’s all because of the internet. The way we shop and see what everyone else is wearing has shifted to online platforms. Now, we need a system that tells us how good our outfit is. With this digital transformation came a need for automated clothing outfit rating systems. The one presented in this paper uses Convolutional Neural Networks (CNNs). Using deep learning and computer vision techniques, our system analyzes and evaluates clothing outfits based on many visual features. Some of those include color combinations, clothing styles, and overall aesthetics. To make it work, we took a pre-trained CNN architecture and fine-tuned it with a large dataset of labeled clothing outfits. The methodology works like this: the system first takes each piece of clothing in an outfit to extract feature representations from them. Then they’re combined to give the entire outfit an overall rating. We tested the performance using countless outfits with different styles, colors, and more. The results showed that our approach was highly effective at providing accurate ratings that have meaning behind them.
VLR23-P23	MRM2022002 AKASH TYAGI, MRM2022003 ANKIT RAJ RAVI, MRM2022004 ADITYA, MRM2022005 HIMANSHU MISHRA	Impact of Different Activation Functions on ViT Model	The advent of Visual Transformer (ViT) models has heralded a novel approach in handling image data, veering from traditional Convolutional Neural Networks (CNNs) towards leveraging transformer architectures. A significant factor influencing the ViT model’s performance and training dynamics is the choice of activation functions, which induce the requisite nonlinearity for complex pattern recognition. This study embarks on an in-depth examination of various activation functions to discern their impact on the ViT model’s effectiveness, training dynamics, and computational efficiency across multiple datasets. The aim is to furnish a nuanced understanding of how activation functions affect the learning, generalization, and robustness of ViT models, and provide empirical guidelines for their optimal selection in different computer vision applications. Our findings elucidate the critical role of activation functions, offering valuable insights for the enhanced tuning and optimization of ViT models in computer vision tasks.
VLR23-P24	IIT2020007 SHUBHAM KUMAR BHOKTA, IIT2020022 RAHUL MAHTO, IIT2020024 SHASHIKANT THAKUR, IIT2020043 ROHIT CHOWDHURY, IIT2020220 MOHIT KUMAR	Identification of Artificially Generated Images	In our proposed approach, we use Deep Convolutional Neural Networks (DNNs), in particular the ResNet structure, to distinguish between real and fake images. We concentrate on using unique patterns, and features found in the pixel level and structural properties of the generated images.
VLR23-P25	IIT2020067 ADITYA SINGH, IIT2020070 EKAGRA SINHA, IIT2020089 DEVESH KUMAR PARTE, IIT2020101 LUKESH NITIN PATIL, IIT2020105 JAMBHULE SAHAS DEVIDAS	Student Counting in Classroom	In this paper we propose a model to count the number of students in a classroom like environment to estimate the count. This would be helpful to estimate the crowd density and may be helpful in making some decisions or predictions.
VLR23-P26	RSI2022502 AJAY KUMAR YADAV	Analysis of Robustness in Deep Learning Models
VLR23-P27	RSI2023001 AKASH VERMA	Efficient ViT Models for Small-scale Datasets

Project ID

Team

Project Title

Abstract

VLR23-P01

IIT2020011 ANKIT KUMAR

Image Super-resolution

SwinIR gives good results for the task of image super resolution. In the paper we explain the architecture of swinir and provide comparison of performances of different techniques.

VLR23-P02

IIB2020008 SAMRIDDHI V WALIAm IIB2020014 MOHAN LAL AGARWALA, IIB2020502 ANIRUDDH SHARMA, IIT2020166 SHANTANU CHAUDHARY

Human Counting in Crowded Scenerio using DETR

In this report, we present a Human Detection and Counting System developed using YOLOv3, a state-of-the-art deep learning algorithm for real-time object detection. The primary objective of this system is to provide efficient and accurate human detection in various surveillance scenarios, ranging from retail space monitoring to crowd management in public transportation systems. The importance of this system is underscored by its potential applications in public safety and health, particularly in contexts like monitoring crowd sizes for disease control purposes. The YOLOv3 algorithm is chosen for its balance between speed and accuracy, making it suitable for real-time application scenarios. Our system demonstrates its capability to effectively detect and count humans in diverse and dynamic environments, highlighting its potential as a versatile tool in surveillance and monitoring applications.

VLR23-P03

MML2022001 RUPESH G, MML2022004 RAJ AHAMED SHAIK, MML2022016 ASHUTOSH VERMA

Single Image Dehazing

Image dehazing is the process of generating clear and haze-free from hazy photographs. Despite the fact that convolutional neural networks are commonly used for this task, image dehazing has yet to gain from recent breakthroughs in high-level vision problems using vision Transformers. This paper’s authors investigate the usage of Swin Transformer for image dehazing and propose DehazeFormer, which comprises changes to the normalisation layer, activation function, and spatial information aggregation approach. Many iterations of DehazeFormer were trained on different datasets to demonstrate its efficiency. The large model outperformed all previous stateof- the-art approaches on the SOTS indoor set, whereas the small model outperformed FFA-Net with a substantially lesser number of parameters and computational cost. The approach’s efficiency on severely non-homogeneous haze was further evaluated using a large realistic remote sensing dehazing dataset acquired by the researchers.

VLR23-P04

IIT2020018 BOTTE SHREYA, IIT2020040 KATAM BALA PRASANNA BABU, IIT2020199 VELPULA VAMSHI, IIT2020217 VELAGANA NAGENDRA, IIT2020255 DONTHOJU RAGHAVA

Cross Day-Night Image Classification

Image classification under cross day and night scenarios is a challenging problem in computer vision. The challenge of training a model on daytime photos from six distinct classes and assessing its performance on nighttime images from the same classes is covered in detail in this article. In addition to reviewing pertinent literature, describing the dataset, outlining the approach, and presenting experimental findings, we also explore the problems that this endeavour presents. The purpose of the study is to clarify any potential obstacles and solutions to this issue.

VLR23-P05

IIT2020173 ANISH JAIN, IIT2020181 JINIYA SINGAL, IIT2020182 DABERAO AKSHAY GAJANAN, IIT2020185 PATEL SAURABH, IIT2020188 SOLANKI TANMAY MOHANBHAI

Hand Gesture Recognition using Deep CNN

Hand gesture recognition is a critical component that offers a natural and intuitive means of communication with machines. In this paper, we are presenting a novel approach to automated hand gesture recognition utilizing a deep convolutional neural network model. This model is designed to address the challenges like variations in hand poses, complex backgrounds, and lighting conditions, so that it can works on real-world applications.

VLR23-P06

IIT2020031 RAUNAK KRISHAN JAISWAL, IIT2020033 ADITYA BISWAKARMA, IIT2020055 SAURABH KUMAR, IIT2020106 NEEL PATEL, IIT2020243 AKULA ABHIRAM

Facial Micro-Expression Recognition using Deep Learning Techniques

In computer vision, micro-expression (ME) detection refers to the process of detecting micro facial expressions in still and moving images and videos. Our work presents a novel CNNbased method based on ME datasets to measure real emotions. MEs are very brief and unconscious to notice even by humans and reverals hidden emotions of inner mind thereby presents a challenging and potential are of research.Our model aims at improving accuracy by overcoming the limitations in the present method. As final-year BTech students in India, our research contributes to understanding hidden emotions, paving the way for future investigations across various applications.

VLR23-P07

IIT2020005 PUSHKAL MADAAN, IIT2020006 RITEJ DHAMALA, IIT2020008 AVISHKAR SINGH, IIT2020077 ANUSHKA AJIT DANDAWATE, IIT2020252 KAVITA

Self-Supervised Image Retrieval

In this paper we purpose a self-supervised image retrieval system to effectively and efficiently be able to work on large, unlabeled datasets, specifically on the satellite image dataset called UC Merced dataset. This system leverages a pre-trained ResNet50 model which helps our main Siamese network efficiently learn from the semantic similarities of our unlabeled data. In addition, we have used KNN to retrieve most similar images to a queried test image and measured its top-5 accuracy. The study also looks into how deep neural architectures might improve self-supervised retrieval systems. Design decisions that might impact the effectiveness of self-supervised models are examined, including architectural options, model complexity, and transferability between datasets.

VLR23-P08

MML2022002 HARSH, MML2022011 UMESH MAURYA

Tiny Face Detection

This project proposes an innovative approach for the detection of small faces in photographs by leveraging the power of Generative Adversarial Networks (GANs). Recognizing small and diminutive faces in real-world images has proven to be a challenging task using existing face identification methods. To address this challenge, a two-stage methodology is introduced, where a GAN is initially employed to generate high-resolution images of small faces. The GAN is trained on an extensive dataset of facial photos and learns to generate high-quality images of small faces while considering both the input image and the desired face size. These artificially generated small face images serve as a valuable resource for augmenting the training data of a face detection model. This face detector is trained separately using a distinct set of images. By incorporating the generated images into the training dataset, the face detection model becomes more adept at accurately identifying small faces in photographs, enhancing its overall performance.

VLR23-P09

IIB2020030 MANISH KUMAR, IIT2020021 HARSHITA VYAS, IIT2020037 SAKSHI, IIT2020095 AMBIKESH ARMAN, IIT2020134 SHAH KRISHNA DINESHKUMAR

Image Denoising using Image-to-Image Translation

Image denoising, a crucial feature in today’s visual technology, involves the elimination of unwanted noise from images. Despite modern cameras capturing high-resolution pictures, the challenge persists in obtaining noise-free images. This necessitates pre-processing or post-processing techniques to diminish noise without compromising image quality. Our approach involves leveraging ”Autoencoder” technology for image denoising. Autoencoders possess the unique ability to self-learn from provided data, generating a model based on data rather than predefined filters. Additionally, they ensure consistency by delivering the same output as input, affirming image quality preservation. Though computational time remains a potential concern, the benefits of employing Autoencoders in denoising applications, from medical systems to smartphone image enhancement, are substantial. This paper explores the utilization of Autoencoders, a deep learning technique employing downsampling and up-sampling methods, as a solution to the denoising problem. Keywords: image, noise, autoencoder, denoise.

VLR23-P10

IIB2020036 MIRIYALA POOJITHA, IIT2020144 PRANAV RAJ, IIT2020151 SHIVAM KATIYAR, IIT2020163 SARTHAK DALMIA, IIT2020205 ADITYA RAJ

Drowsiness Detection using Faces

Drowsiness has become the focus of researchers' attention in recent years as it is the cause of many traffic accidents. It can also help determine the need for rest. Drivers experience fatigue when they drive for long periods of time, which affects their driving ability and can lead to death and injuries in car accidents. Fatigue can be caused by long driving, discomfort, headache, alcohol and drugs, etc. it could be. This research will play an important role in the lives of drivers and could save their lives. In this article, I introduce an Android application that can detect sleep, activity and blink count. The app will sound an alarm when the driver falls asleep and could save the driver's life. The app will provide five data: state percentage, sleep, blink count, yawn count, and number of frames captured by the app's camera photo. It retrieves login information, then provides it and sounds a warning if the driver is drowsy.

VLR23-P11

IIT2020227 MOHD WASIF, IIT2020242 MOHD SARFARAZ, IIT2020247 SANJAY RAM, IIT2020254 CHAUDHARI YOGIRAJ PRAKASH, IIT2020259 ANKIT KUMAR

Photo ID Retrieval from Arbitrary Face Query

The ”Photo ID Retrieval from Arbitrary Face Query” project aims to develop a sophisticated face recognition system capable of identifying individuals based on their facial features. The project uses the CNN model for feature extraction and evaluates the system’s performance on the LFW (Labeled Faces in theWild) dataset, a widely recognized benchmark for face recognition. This report provides a comprehensive overview of the system, including data collection, preprocessing, feature extraction, database creation, face query process, visualization, experimental results, and a thorough discussion.

VLR23-P12

IIB2020016 ANURAG HARSH, IIB2020018 ABHISHEK KUMAR, IIB2020024 VAIDIK SHARMA, IIB2020027 AMAN UTKARSH, IIT2020140 AYUSHI

Image Caption Generation

Deep learning model for Image Caption Generation. This model takes images as input and then generates captions for those images. We have made use of transfer learning, word embeddings and custom data generators for building this model. We have evaluated the relevancy of generated captions using BLEU score and computed both the individual and cumulative BLEU scores. We have used python,keras and tensorflow for the development of this model.

VLR23-P13

IIT2020052 SANJEET, IIT2020053 SAMEER AHMED, IIT2020082 HARSH GARG, IIT2020218 JITU RAJAK, IIT2020244 RAHUL

Selfie vs Non-selfie Classification

In the era of ubiquitous smartphone usage and social media platforms, the line between personal and nonpersonal images has blurred significantly. Selfies, which are selfportraits typically taken with a smartphone camera, have become a ubiquitous form of self-expression. However, distinguishing between selfie and non-selfie images automatically presents an interesting and challenging problem in computer vision. This semester project targets to address the problem by developing a robust image classification system that can accurately differentiate between selfie and non-selfie images.

VLR23-P14

MML2022009 MANISH KUMAR, MML2022013 BHAVESH KUMAR BOHARA, MML2022014 KAVATHIYA KHYATI HARESHBHAI

Image Deraining

It can be challenging to remove rain streaks from a single photograph because rainy photos usually include rain. streaks of different densities, sizes, shapes, and directions. Most Current methods for deraining use a deep network that adheres to a broad model. Low-level characteristics are recorded by a ”encoder-decoder” design. above the first layers, with elevated characteristics at a deeper level. The must-be rain streaks be taken out for the purpose of deraining are quite small, thus Stressing global aspects is not always an effective strategy. to resolve the issue. Therefore, in this essay, We suggest utilizing a convolutional network architecture that is excessively complex emphasizes understanding local structures. By restricting the filters’ receptive field.In order to compute the derained image, we combine it with U-Net to ensure that it concentrates more on lowlevel features and does not overlook global structures. The overand- under complete deraining network (OUCD) is a suggested network that is split into two branches: an undercomplete branch that focuses on global structures and an overcomplete branch that focuses on local structures with bigger receptive fields. Numerous experiments on artificial and real-world datasets demonstrate that the proposed strategy performs better than the most recent state-of-the-art techniques.

VLR23-P15

IIT2020158 S ANURAG REDDY, IIT2020164 SAVALA DEEPIKA, IIT2020213 ANKADALA JEEVAN, IIT2020250 PULUKURI JAGADEESH, IIT2020266 NENAVATH ABHIRAM NAIK

Homography Matrix Computation between Images using Deep Learning

We introduce a deep convolutional neural network designed to approximate the relative homography of two images.Ten layers make up our feed-forward network, which requires two stacked with grayscale pictures as input, it generates eight degrees of freedom homography that enables the mapping of the pixels from the first to the second picture. We offer two convolutional neural networks.HomographyNet's network architectures: a regression network.It computes the real-valued homography parameters directly as well as a classification network that generates a distribution across homographies quantized.We employ a 4-point homography parameterization, which involves projecting the four corners of one image onto the other. Our networks are trained with distorted MS-COCO pictures in an end-to-end manner. Our method functions without requiring distinct phases for transformation estimates and local feature detection. We compare our deep models with a conventional homography estimator based on ORB features, and we show the situations in which HomographyNet performs better than the conventional method. We further highlight the versatility of a deep learning method by describing a range of applications driven by deep homography estimation.

VLR23-P16

IIT2020044 PRIYA DEVI, IIT2020060 PERISETLA SRI SATWIK, IIT2020065 DASA AKSHITHA, IIT2020196 KALYANI BHUSHAN PHARKANDEKAR, IIT2020208 MARPINA SRUJANA

Face Recognition from Partial Faces

Partially hidden or missing facial traits can be used to identify people using partial face recognition, a crucial branch of facial recognition technology. Due to its potential uses in security, surveillance, and human-computer interaction, this specialized topic has grown in popularity. People frequently exhibit their faces under a variety of circumstances, such as partial occlusion by masks, accessories, or poor lighting, hence the capacity to identify and confirm persons using constrained facial information is crucial in practice. This article examines the difficulties, approaches, and developments in partial face recognition, providing information on how it is changing in the context of biometric identification and surveillance systems.

VLR23-P17

MHC2022001 AMIT ROY, MHC2022011 BHARGAV BURMAN, MHC2022013 HARSHIT GUPTA, MML2022003 DIPANKAR KARMAKAR

Plant Disease Classification

The global population growth has resulted in a scarcity of essential resources such as raw materials and food supply. The agricultural industry has emerged as the primary and key source for addressing this specific issue. However, the agricultural industry as a whole faces significant challenges due to the presence of pests and various crop diseases.I would like to kindly request that you rewrite my previous text in a more academic manner. Plant diseases provide a significant challenge to global agriculture, leading to substantial crop losses. The complexity and difficulty of recognising illnesses in plants can be attributed to a dearth of specialised expertise. The utilisation of deep learningbased models enables the diagnosis of plant diseases through the analysis of leaf photographs.The primary challenges that remain to be addressed in these algorithms include the necessity for larger training sets, considerations related to computational complexity, the issue of overfitting, and other associated difficulties. This study centres on a novel machine learning model, derived from conventional neural networks (CNN), with the aim of enhancing its effectiveness. Additionally, it provides a concise summary of the existing published solutions in this domain.In order to increase the size of the training set without the need for additional photographs, several augmentation techniques such as shift, shear, scaling, zooming, and flipping are employed. These approaches serve to produce additional samples, hence enlarging the training set. The Convolutional Neural Network (CNN) model has been trained on a publically accessible dataset called PlantVillage. The purpose of this training is to accurately detect and classify the presence of Early Blight and Late Blight illnesses in potato leaves.

VLR23-P18

MHC2022005 DASAROJU JAGANNADHACHARI, MML2022005 PRAGATI, MML2022007 SAYANTAN CHAKRABORTY, MRM2022006 BEHERA JYOTHIKRISHNA

Image Inpainting Using GAN

Image Inpainting is an important topic of research in the field of image processing. The prime goal of image inpainting is to recover missing details in an image, demosaicing the image etc. In this paper we discuss progress of the image inpainting project using deep learning models, namely Generative Adversarial Network. The project has been implemented in the environment of google colaboratory.

VLR23-P19

IIT2020025 MANPREET SINGH, IIT2020032 KARTIK GUPTA, IIT2020219 Tanu Shree Suthar, IIT2020221 TUSHAR AGGARWAL

Visual Grounding using CNNs

Computer visions algorithms are used to predict a limited number of object types, limiting their generality and applicability. Learning directly from raw text about images is a potential option that takes advantage of a larger supply of supervision. On a large dataset of ~million image and text pairs acquired from the internet, a simple pre-training job of predicting which caption goes with which image is an efficient and scalable technique to train on State of art image representations from scratch. Following pre-training, natural language is used to refer to previously learned visual concepts or explain new ones, allowing for zero-shot model transfer to downstream tasks. The model easily applies to most tasks and is frequently competitive with a fully supervised baseline without requiring dataset-specific training.

VLR23-P20

IIT2020009 AASHISH AGRAWAL, IIT2020010 RAJ CHHARI, IIT2020183 LOKESH MEHTA, IIT2020209 AADITYA RATHOD, IIT2020505 AKSHAT GHARIYA

Viewpoint Invariant Scene Recognition of IIITA Campus using Deep Learning

In this paper, we use deep learning approaches to handle the challenge of perspective invariant scene detection on the campus of Indian Institute of Information Technology Allahabad (IIITA). The principal aim is to create a clever computer system that can identify different sights and landmarks on the IIITA campus, all the while maintaining resilience to changes in perspective. We investigate the creation and use of a Convolutional Neural Network (CNN) model specifically designed for picture classification. Our main goal is to use this model to categorize photos from the ”Campus Images Dataset,” which is a set of ten different categories. The campus is covered by these categories, which include admin, adminback, audi, cafeteria, cc2, cc3, cc3back, library, mandir, and rm.

VLR23-P21

IIB2020021 GAGAN BANSAL, MML2022006 MOHD FAIZ ANSARI, MML2022008 RAKSHIT SANDILYA, MML2022010 NIKHIL RAJPUT, MML2022012 HIMANSHU MITTAL

Thermal to Visible Image Translation

This review report details the implementation and assessment of a pix2pix GAN for thermal to visual-picture translation. Deep learning algorithms will be used in the project to produce high-quality visual images from thermal photos. An overview of the methodology, data collection, and pre-processing procedures is provided in the report. The SSIM and PSNR measurements were used to assess the model. The outcomes demonstrate that the suggested method is successful in converting thermal images into realistic and excellent visual images. This method may be used in a variety of industries, including surveillance, search and rescue, and medical imaging. Overall, the article shows how well pix2pix GAN works for picture translation tasks and offers suggestions for further study in this field.

VLR23-P22

IIT2020154 SHIVEK PAMNANI, IIT2020160 ANUSHKA ARUN KALWALE, IIT2020179 KARUS MANISHA, IIT2020189 ROUNAK DEV, IIT2020190 MALYALA MEGHAMSH

Clothing Outfit Rating using CNNs

The fashion industry is changing, and it’s all because of the internet. The way we shop and see what everyone else is wearing has shifted to online platforms. Now, we need a system that tells us how good our outfit is. With this digital transformation came a need for automated clothing outfit rating systems. The one presented in this paper uses Convolutional Neural Networks (CNNs). Using deep learning and computer vision techniques, our system analyzes and evaluates clothing outfits based on many visual features. Some of those include color combinations, clothing styles, and overall aesthetics. To make it work, we took a pre-trained CNN architecture and fine-tuned it with a large dataset of labeled clothing outfits. The methodology works like this: the system first takes each piece of clothing in an outfit to extract feature representations from them. Then they’re combined to give the entire outfit an overall rating. We tested the performance using countless outfits with different styles, colors, and more. The results showed that our approach was highly effective at providing accurate ratings that have meaning behind them.

VLR23-P23

MRM2022002 AKASH TYAGI, MRM2022003 ANKIT RAJ RAVI, MRM2022004 ADITYA, MRM2022005 HIMANSHU MISHRA

Impact of Different Activation Functions on ViT Model

The advent of Visual Transformer (ViT) models has heralded a novel approach in handling image data, veering from traditional Convolutional Neural Networks (CNNs) towards leveraging transformer architectures. A significant factor influencing the ViT model’s performance and training dynamics is the choice of activation functions, which induce the requisite nonlinearity for complex pattern recognition. This study embarks on an in-depth examination of various activation functions to discern their impact on the ViT model’s effectiveness, training dynamics, and computational efficiency across multiple datasets. The aim is to furnish a nuanced understanding of how activation functions affect the learning, generalization, and robustness of ViT models, and provide empirical guidelines for their optimal selection in different computer vision applications. Our findings elucidate the critical role of activation functions, offering valuable insights for the enhanced tuning and optimization of ViT models in computer vision tasks.

VLR23-P24

IIT2020007 SHUBHAM KUMAR BHOKTA, IIT2020022 RAHUL MAHTO, IIT2020024 SHASHIKANT THAKUR, IIT2020043 ROHIT CHOWDHURY, IIT2020220 MOHIT KUMAR

Identification of Artificially Generated Images

In our proposed approach, we use Deep Convolutional Neural Networks (DNNs), in particular the ResNet structure, to distinguish between real and fake images. We concentrate on using unique patterns, and features found in the pixel level and structural properties of the generated images.

VLR23-P25

IIT2020067 ADITYA SINGH, IIT2020070 EKAGRA SINHA, IIT2020089 DEVESH KUMAR PARTE, IIT2020101 LUKESH NITIN PATIL, IIT2020105 JAMBHULE SAHAS DEVIDAS

Student Counting in Classroom

In this paper we propose a model to count the number of students in a classroom like environment to estimate the count. This would be helpful to estimate the crowd density and may be helpful in making some decisions or predictions.

VLR23-P26

RSI2022502 AJAY KUMAR YADAV

Analysis of Robustness in Deep Learning Models

VLR23-P27

RSI2023001 AKASH VERMA

Efficient ViT Models for Small-scale Datasets

Disclaimer

The content (text, image, and graphics) used in this slide are adopted from many sources for Academic purposes. Broadly, the sources have been given due credit appropriately. However, there is a chance of missing out some original primary sources. The authors of this material do not claim any copyright of such material.