Indian Institute of Information Technology, Allahabad

Computer Vision and Biometrics Lab (CVBL)

Visual Recognition

July - Dec 2021

Course Information

Objective of the course: The field of visual recognition has become part of our lives with applications in self-driving cars, satellite monitoring, surveillance, video analytics particularly in scene understanding, crowd behaviour analysis, action recognition etc. It has eased human lives by acquiring, processing, analyzing and understanding digital images and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information. The visual recognition encapsulates image classification, localization and detection. The course on visual recognition will help students understand new tools, techniques and methods which are influencing the visual recognition field.

Outcome of the course: At the end of this course, the students will be able apply the concepts to solve some real problems in recognition. The students will be able to use computational visual recognition for problems ranging from extracting features, classifying images, to detecting and outlining objects and activities in an image or video using machine learning and deep learning concepts. The student will be also being able to invent new methods in visual recognition for various applications.

Class meets
Monday: 04.00 - 06.00 pm, Friday: 10.00 - 12.00 pm and 04.00 - 06.00 pm; Remote

Schedule - Lectures

Date Topic Optional Reading
L01: July 30: 04.00 PM - 05.00 PM Introduction Lecture
Slide, Recorded Lecture
L02: July 30: 05.00 PM - 06.00 PM Local Features: What, Why and How
Slide, Recorded Lecture
L03: August 06: 10.00 AM - 11.00 AM Corner Detection
Slide, Recorded Lecture
L04: August 06: 11.00 AM - 12.00 PM Harris Detector and Invariance Property
Slide, Recorded Lecture
L05: August 09: 04.00 PM - 05.00 PM Blob Detection: Harris-Laplacian (LoG), SIFT (DoG), Affine Invariant Detection
Slide, Recorded Lecture
L06: August 09: 05.00 PM - 06.00 PM Feature Description: SIFT and SURF
Slide, Recorded Lecture
L07: August 13: 10.00 AM - 11.00 AM Feature Description: LBP and HOG
Slide, Recorded Lecture
L08: August 27: 10.00 AM - 11.00 AM Image Categorization and Bag of Visual Words
Slide, Recorded Lecture
L09-11: August 27: 11.00 AM - 12.00 PM & 4.00 PM - 6.00 PM Classifiers for Image Categorization: KNN, Linear Classifier, SVM, Softmax
Slide, Recorded Lecture 1 Recorded Lecture 2
L12-13: August 30: 04.00 PM - 06.00 PM Neural Networks
Slide, Recorded Lecture
L14-15: September 03: 10.00 AM - 12.00 PM Convolutional Neural Networks (CNNs)
Slide, Recorded Lecture
L16-17: September 06: 04.00 PM - 06.00 PM Training Aspects of CNN: Activation Functions, Data Split, Data Preprocessing and Weight Initialization
Slide, Recorded Lecture
L18-19: September 10: 04.00 PM - 06.00 PM Training Aspects of CNN: Optimization, Learning Rate, Regularization, Dropout, Batch Normalization, Data Augmentation and Transfer Learning
Slide, Recorded Lecture
L20-21: September 24: 04.00 PM - 06.00 PM CNN Architectures - Plain Models: LeNet, AlexNet, VGG, NiN
Slide, Recorded Lecture1, Recorded Lecture2
L22-23: October 01: 04.00 PM - 06.00 PM CNN Architectures - DAG Models: GoogleNet, ResNet, DenseNet, etc.
Slide, Recorded Lecture1, Recorded Lecture2
L24-25: October 08: 10.00 AM - 12.00 PM CNN Architectures for Object Detection - R-CNN, Fast R-CNN, Faster R-CNN, YOLO, etc.
Slide, Recorded Lecture
    L26: October 23: 10.00 AM - 11.00 AM Special Lecture on Person Recognition A Biometric Approach by Dr. Satish Kumar Singh
    Lecture Slide
      L27: October 23: 11.00 PM - 12.00 PM Special Lecture on Multimodal Biometrics A Reliable Way by Dr. Satish Kumar Singh
      Lecture Slide
        L28: October 23: 03.00 PM - 04.00 PM Special Lecture on DL Architectures for Recognition by Dr. Satish Kumar Singh
        Lecture Slide, Recorded Video
          L29: October 24: 10.00 AM - 11.00 AM Special Lecture on Hand Shape Coding Multimodal Biometric by Dr. Satish Kumar Singh
          Lecture Slide, Recorded Video
            L30: October 24: 10.00 AM - 11.00 AM Special Lecture on Face Recognition under Surveillance by Dr. Satish Kumar Singh
            Lecture Slide
              L31: October 26: 06.00 PM - 07.00 PM Special Lecture on Biometric Security by Prof. Pritee Khanna (IIITDM Jabalpur)
              Recorded Video
                L32: October 26: 07.00 PM - 08.00 PM Special Lecture on DeepFakes by Dr. Kiran Raja (NTNU Norway)
                Recorded Video
                  L33: October 26: 08.00 PM - 09.00 PM Special Lecture on Face Anti-spoofing by Dr. Shiv Ram Dubey
                  Lecture Slide, Recorded Video
                    L34: October 27: 08.00 PM - 09.00 PM Special Lecture on Facial Micro-expression Recognition by Dr. Shiv Ram Dubey
                    Lecture Slide, Recorded Video

                      Schedule - Tutorials and Labs

                      Date Topic Optional Reading
                      TL01-02: July 30: 10.00 AM - 12.00 PM Introduction to Python
                      Recorded Video
                      TL03-04: August 02: 04.00 PM - 06.00 PM Introduction to Python
                      Recorded Video
                      TL05-06: August 07: 10.00 AM - 12.00 PM Introduction to Python
                      Recorded Video
                      TL07: August 13: 11.00 AM - 12.00 PM Project Discussions
                      TL08-09: August 13: 04.00 PM - 06.00 PM Project Discussions
                      TL10-11: September 03: 04.00 PM - 06.00 PM Project Work
                      TL12-13: September 10: 10.00 AM - 12.00 PM CRP Assessment 1
                      TL14-15: October 04: 04.00 PM - 06.00 PM Project Discussions
                      TL16-17: October 08: 04.00 PM - 06.00 PM Project Discussions
                      TL18-19: October 18: 04.00 PM - 06.00 PM CRP Assessment 2

                      Computational Projects Added to Teaching Laboratories

                      Project ID Team Project Title Abstract
                      VR21_P01 Chinmay Tayade (IIT2018138), Inayat Baig (IIT2018165), Madhu (IIT2018068), Gurutej (IIT2018193) Number Plate Detection and Identification of Vehicles Automatic License Plate Recognition (ALPR) has been a frequent topic of research due to many practical applications. However, many of the current solutions are still not robust in real-world situations, commonly depending on many constraints. The vehicles have their own unique Number plate. Therefore we can identify the vehicle and the owner details of any particular vehicle. Therefore, we propose a model based on YOLO (a Deep learning based object detection architecture) and OCR which detects the vehicles and their number plates and identifies the other details of the vehicle.
                      VR21_P02 Vikash (IIT2018110), Hitesh Kumar (IIT2018160), Shubham S (IIT2018200), M J Akhil Naik (IIT2018143), Nilang (IIT2018147) Moving Object Detection and Tracking with ISR Object detection and tracking are the critical steps of computer imaginative and prescient algorithms. The sturdy object detection is the task due to variations inside the scenes. any other largest undertaking is to song the item within the occlusion conditions. In this method, shifting gadgets detection using TensorFlow item detection API, and CNN based object detection algorithm is used for strong object detection which takes the vicinity of the detected object as enter. The proposed method is able to detect and tune the object in one of a kind illumination and occlusion from mpeg video layout.
                      VR21_P03 Raushan Raj (IIT2018031), Bindu (IIT2018105), Ayushi Gupta (IIT2018118), Sanjana (IIT2018120) Masked Face Recognition using Neural Networks In this technological era, artificial intelligence has become the new powerhouse of data analysis. With the advent of various machine learning and computer vision algorithms, their application in data analysis has become a general trend. However, the application of deep neural networks to various tasks of analyzing masked face data and studying the performance of these models has yet to be explored to a great extent. So in this article we have proposed a model, which has been trained as such when you feed an image as input, our model will be in a position to recognize that and will print the name of the person. Our proposed models achieved fairly high precision with a low cross entropy rate.
                      VR21_P04 Atul Kumar (IIT2018030), Aman Raj Patwa (IIT2018038), Anshul Ahirwar (IIT2018099) Table Detection and Content Extraction from PDF Document Images Detecting and recognizing objects in unstructured environments is a difficult task in computer vision research. Table detection in document images is a challenge because the tables are diverse in size and complexity. This work provides an effective way to not only detect Tables but also using OCR on the detected text to extract the required content of the table.
                      VR21_P05 Kisalaya Kishore (IIT2018079), Milan Bhuva (IIT2018176), Mohammed Aadil (IIT2018179), Ankit Rauniyar (IIT2018202) Interactive Indoor Scene Description to Aid in Navigation for Visually Impaired Individuals using Deep Learning This work introduces a methodology to help visually impaired person to avoid obstacles in an indoor environment. We use DepthNet-MiDaS large model to get the depth map of the scene, and then use sparse optical flow in parallel to predict the path of interested objects. This is done in order to recognise objects that might cross paths and are of potential danger to the user.
                      VR21_P06 Naukesh Goyal (IIT2018092), N.Lokesh Naik (IIT2018104), Nikhil Kumar (IIT2018152), Vishal Muwal (IIT2018153) Rating Content Based on Real-Time User Expressions We are making a custom facial emotion detection system with a validation accuracy of 68.32% on the FER2013 dataset along with a proof of concept web app that captures your facial emotion data while you watch any content in a temporal manner with dynamic granularity so that we can generate better input data for advanced user content recommendation algorithms. There are a few state-of-the-art models with accuracy as high as 71-72 percent but our model is quite a bit faster than those models and it can be used in real-time on mobile devices as the trained model will result in fewer parameters.
                      VR21_P07 Hrutvik Kailas Nagrale (IIT2018088), Aastha Kumari (IIT2018091), Ravi Kumar Sharma (IIT2018108), Ratan Kumar Mandal (IIT2018136) Real Time Indian Sign Language Recognition Sign language is one of the oldest and most natural forms of language for communication, but since most people do not know sign language and interpreters are very difficult to find in day to day conversations, we have come up with a real time method for finger spelling based Indian sign language. In our method, the hand is first passed through a median blur filter and canny edge detection algorithm is applied to the filtered image, then feature extraction using SURF is performed on the result and further model of visual words are obtained after clustering, and then svm classifier is trained on histogram computed through these visual words to generate our model. The method used provides 99% accuracy for the 26 letters of the alphabet and 1 to 9 numbers.
                      VR21_P08 Jaya Meena (IIT2018029), Suryasen Singh (IIT2018069), Rahul Yadav (IIT2018071), Vineet Kumar (IIT2018096) Explicit Image Detection This paper explores the task of classification of images as an explicit or non-explicit. There are various ways to deal with binary classification of images varies from simple CNN architecture to more sophisticated model as VGG, ResNet, etc. After several hit and trial, We settled with ResNet architecture for our classification task as it had the higher accuracy among other tried models. The dense layer and the output layer of the model fine tuning has been found by hit and trial by recursive testing of the designed model architecture. In this paper we have proposed an approach for detecting images which are considered to be explicit or Not Safe for Work contents and prevent the consumption of such explicit content. The proposed model is a deep learning model which is based upon a residual network. The model returns a numerical value which is a measure of explicitness of the input media content. The numerical value is compared to a further defined threshold to categorize the content into explicit or non-explicit content.
                      VR21_P09 Rithik Seth (IIT2018032), Hardik Kumawat (IIT2018034), Aman Joshi (IIT2018042), Milind Khatri (IIT2018082) Vision Assistant for Visually Impaired Individuals Artificial Intelligence has been touted as the next big thing that is capable of altering the current landscape of the technological domain. Through the use of Artificial Intelligence and Machine Learning, pioneering works have been undertaken in the area of Vision and Object Detection. In this work, we undertake the analysis of a Vision Assistant Application for guiding the visually-impaired individuals. With recent break- through in the computer vision and supervised learning models, the problem at hand has been reduced significantly to the point where new models are easier to build and implement over the already existing models. Now, different Object Detection models exist which provides object tracking and detection with great accuracy. These techniques have been well used in automating the detection task in different areas. Some of the newfound detection approaches such as YOLO (You Only Look Once), SSD (Single Shot Detector) and R-CNNs have proved to be consistent and pretty accurate in Real-Time Object Detection. We are going to have a brief look at these techniques in order to find a good base model for implementing our ‘Vision Assistant’.
                      VR21_P10 Rahul Reddy Muppidi (IIT2018103), G Shashank (IIT2018106), A Prathyush (IIT2018124), A Rahul Naidu (IIT2018192) Building Detection from Aerial Images In this project, we do building extraction from high resolution aerial images. There are many real life applications like government decision making, civil defense operations, police and Geographic Information Systems. But the complexity of building extraction methods from the very high resolution methods is still a great challenge. Hence using deep learning methods has decreased the complexity and increased the accuracy by great extent. In this project, a framework based on CNN and edge detection is proposed. The method is called Mask R-CNN. This method is applied in Satellite image building extraction. Our method consists of three parts. Quickly identifying buildings in disaster areas plays a major role in disaster assessment. This paper combines traditional digital image processing methods and CNNs. Mask R-CNN improves detection accuracy. As the complexity is also decreased, the computational time is also reduced by a great extent.
                      VR21_P11 Ashwani Rai (IIT2018006), Sanjay Swami (IIT2018014), Sunidhi Kashyap (IIT2018016), Abhishek Mishra (IIT2018026) Deep CNN Model for Smoke Detection in Normal And Foggy Environment Smoke detection is very important especially in foggy environment so this is our problem statement “Design a model which can take care of the increase in fire accidents in smart cities”.
                      VR21_P12 Kshitij K. Gautam(IIT2018037), Rahul Thalor (IIT2018070), Divyansh Bhorvanshi (IIT2018072), Sourabh Thakur (IIT2018101) Hand Gesture Recognition: Contactless ATM As we know, in a pandemic a lot of people fear going out and are scared of touching anything in public places. But there are some places like ATMs which can’t be accessed without touching. So we came up with a solution to build an Contactless ATM so that people can avoid contact in the ATMs too.
                      VR21_P13 Jishan Singh (IIT2018111), Prabal Tikeriha (IIT2018140), Fahad Ali (IIT2018148), Bhavya Jain (IIT2018151) Disguised Face Age Estimation With time, Age estimation models have improved dramatically but what poses challenges to these is human disguises like mask, beard, mustache, etc. In this work, we develop the solution to estimate the age of the disguised person.
                      VR21_P14 Sagar Kumar (IIT2018154), Kartik Nema (IIT2018156), Bhupendra (IIT2018163), Prakhar Srivastava (IIT2018172) Text to Image Synthesis In this work we discuss a solution to the problem of text to image synthesis by making use of GANs (Generative Adversarial Networks), we start our discussion with a brief introduction of this field, it’s usefulness, challenges involved etc. Then we discuss in depth about GANs, followed by our own proposed methodology.


                      • C1 (30%): 10% Written + 20% Practice
                      • C2 (30%): 10% Written + 20% Practice
                      • C3 (40%): 20% Written + 20% Practice


                      • Computer Programming
                      • Data Structures and Algorithms
                      • Machine Learning
                      • Image and Video Processing
                      • Ability to deal with abstract mathematical concepts


                      The content (text, image, and graphics) used in this slide are adopted from many sources for Academic purposes. Broadly, the sources have been given due credit appropriately. However, there is a chance of missing out some original primary sources. The authors of this material do not claim any copyright of such material.