Indian Institute of Information Technology, Allahabad
Computer Vision and Biometrics Lab (CVBL)
Visual Recognition
July - Dec 2021
Course Information
Objective of the course: The field of visual recognition has become part of our lives with applications in self-driving cars, satellite monitoring, surveillance, video analytics particularly in scene understanding, crowd behaviour analysis, action recognition etc. It has eased human lives by acquiring, processing, analyzing and understanding digital images and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information. The visual recognition encapsulates image classification, localization and detection. The course on visual recognition will help students understand new tools, techniques and methods which are influencing the visual recognition field.
Outcome of the course: At the end of this course, the students will be able apply the concepts to solve some real problems in recognition. The students will be able to use computational visual recognition for problems ranging from extracting features, classifying images, to detecting and outlining objects and activities in an image or video using machine learning and deep learning concepts. The student will be also being able to invent new methods in visual recognition for various applications.
- Class meets
- Monday: 04.00 - 06.00 pm, Friday: 10.00 - 12.00 pm and 04.00 - 06.00 pm; Remote
Schedule - Lectures
Date | Topic | Optional Reading |
L01: July 30: 04.00 PM - 05.00 PM |
Introduction Lecture
Slide, Recorded Lecture |
|
L02: July 30: 05.00 PM - 06.00 PM |
Local Features: What, Why and How
Slide, Recorded Lecture |
|
L03: August 06: 10.00 AM - 11.00 AM |
Corner Detection
Slide, Recorded Lecture |
|
L04: August 06: 11.00 AM - 12.00 PM |
Harris Detector and Invariance Property
Slide, Recorded Lecture |
|
L05: August 09: 04.00 PM - 05.00 PM |
Blob Detection: Harris-Laplacian (LoG), SIFT (DoG), Affine Invariant Detection
Slide, Recorded Lecture |
|
L06: August 09: 05.00 PM - 06.00 PM |
Feature Description: SIFT and SURF
Slide, Recorded Lecture |
|
L07: August 13: 10.00 AM - 11.00 AM |
Feature Description: LBP and HOG
Slide, Recorded Lecture |
|
L08: August 27: 10.00 AM - 11.00 AM |
Image Categorization and Bag of Visual Words
Slide, Recorded Lecture |
|
L09-11: August 27: 11.00 AM - 12.00 PM & 4.00 PM - 6.00 PM |
Classifiers for Image Categorization: KNN, Linear Classifier, SVM, Softmax
Slide, Recorded Lecture 1 Recorded Lecture 2 |
|
L12-13: August 30: 04.00 PM - 06.00 PM |
Neural Networks
Slide, Recorded Lecture |
|
L14-15: September 03: 10.00 AM - 12.00 PM |
Convolutional Neural Networks (CNNs)
Slide, Recorded Lecture |
|
L16-17: September 06: 04.00 PM - 06.00 PM |
Training Aspects of CNN: Activation Functions, Data Split, Data Preprocessing and Weight Initialization
Slide, Recorded Lecture |
|
L18-19: September 10: 04.00 PM - 06.00 PM |
Training Aspects of CNN: Optimization, Learning Rate, Regularization, Dropout, Batch Normalization, Data Augmentation and Transfer Learning
Slide, Recorded Lecture |
|
L20-21: September 24: 04.00 PM - 06.00 PM |
CNN Architectures - Plain Models: LeNet, AlexNet, VGG, NiN
Slide, Recorded Lecture1, Recorded Lecture2 |
|
L22-23: October 01: 04.00 PM - 06.00 PM |
CNN Architectures - DAG Models: GoogleNet, ResNet, DenseNet, etc.
Slide, Recorded Lecture1, Recorded Lecture2 |
|
L24-25: October 08: 10.00 AM - 12.00 PM |
CNN Architectures for Object Detection - R-CNN, Fast R-CNN, Faster R-CNN, YOLO, etc.
Slide, Recorded Lecture |
|
L26: October 23: 10.00 AM - 11.00 AM |
Special Lecture on Person Recognition A Biometric Approach by Dr. Satish Kumar Singh
Lecture Slide |
|
L27: October 23: 11.00 PM - 12.00 PM |
Special Lecture on Multimodal Biometrics A Reliable Way by Dr. Satish Kumar Singh
Lecture Slide |
|
L28: October 23: 03.00 PM - 04.00 PM |
Special Lecture on DL Architectures for Recognition by Dr. Satish Kumar Singh
Lecture Slide, Recorded Video |
|
L29: October 24: 10.00 AM - 11.00 AM |
Special Lecture on Hand Shape Coding Multimodal Biometric by Dr. Satish Kumar Singh
Lecture Slide, Recorded Video |
|
L30: October 24: 10.00 AM - 11.00 AM |
Special Lecture on Face Recognition under Surveillance by Dr. Satish Kumar Singh
Lecture Slide |
|
L31: October 26: 06.00 PM - 07.00 PM |
Special Lecture on Biometric Security by Prof. Pritee Khanna (IIITDM Jabalpur)
Recorded Video |
|
L32: October 26: 07.00 PM - 08.00 PM |
Special Lecture on DeepFakes by Dr. Kiran Raja (NTNU Norway)
Recorded Video |
|
L33: October 26: 08.00 PM - 09.00 PM |
Special Lecture on Face Anti-spoofing by Dr. Shiv Ram Dubey
Lecture Slide, Recorded Video |
|
L34: October 27: 08.00 PM - 09.00 PM |
Special Lecture on Facial Micro-expression Recognition by Dr. Shiv Ram Dubey
Lecture Slide, Recorded Video |
Schedule - Tutorials and Labs
Date | Topic | Optional Reading |
TL01-02: July 30: 10.00 AM - 12.00 PM |
Introduction to Python
Recorded Video |
|
TL03-04: August 02: 04.00 PM - 06.00 PM |
Introduction to Python
Recorded Video |
|
TL05-06: August 07: 10.00 AM - 12.00 PM |
Introduction to Python
Recorded Video |
|
TL07: August 13: 11.00 AM - 12.00 PM |
Project Discussions
|
|
TL08-09: August 13: 04.00 PM - 06.00 PM |
Project Discussions
|
|
TL10-11: September 03: 04.00 PM - 06.00 PM |
Project Work
|
|
TL12-13: September 10: 10.00 AM - 12.00 PM |
CRP Assessment 1
|
|
TL14-15: October 04: 04.00 PM - 06.00 PM |
Project Discussions
|
|
TL16-17: October 08: 04.00 PM - 06.00 PM |
Project Discussions
|
|
TL18-19: October 18: 04.00 PM - 06.00 PM |
CRP Assessment 2
|
|
Computational Projects Added to Teaching Laboratories
Project ID | Team | Project Title | Abstract |
VR21_P01 | Chinmay Tayade (IIT2018138), Inayat Baig (IIT2018165), Madhu (IIT2018068), Gurutej (IIT2018193) | Number Plate Detection and Identification of Vehicles | Automatic License Plate Recognition (ALPR) has been a frequent topic of research due to many practical applications. However, many of the current solutions are still not robust in real-world situations, commonly depending on many constraints. The vehicles have their own unique Number plate. Therefore we can identify the vehicle and the owner details of any particular vehicle. Therefore, we propose a model based on YOLO (a Deep learning based object detection architecture) and OCR which detects the vehicles and their number plates and identifies the other details of the vehicle. |
VR21_P02 | Vikash (IIT2018110), Hitesh Kumar (IIT2018160), Shubham S (IIT2018200), M J Akhil Naik (IIT2018143), Nilang (IIT2018147) | Moving Object Detection and Tracking with ISR | Object detection and tracking are the critical steps of computer imaginative and prescient algorithms. The sturdy object detection is the task due to variations inside the scenes. any other largest undertaking is to song the item within the occlusion conditions. In this method, shifting gadgets detection using TensorFlow item detection API, and CNN based object detection algorithm is used for strong object detection which takes the vicinity of the detected object as enter. The proposed method is able to detect and tune the object in one of a kind illumination and occlusion from mpeg video layout. |
VR21_P03 | Raushan Raj (IIT2018031), Bindu (IIT2018105), Ayushi Gupta (IIT2018118), Sanjana (IIT2018120) | Masked Face Recognition using Neural Networks | In this technological era, artificial intelligence has become the new powerhouse of data analysis. With the advent of various machine learning and computer vision algorithms, their application in data analysis has become a general trend. However, the application of deep neural networks to various tasks of analyzing masked face data and studying the performance of these models has yet to be explored to a great extent. So in this article we have proposed a model, which has been trained as such when you feed an image as input, our model will be in a position to recognize that and will print the name of the person. Our proposed models achieved fairly high precision with a low cross entropy rate. |
VR21_P04 | Atul Kumar (IIT2018030), Aman Raj Patwa (IIT2018038), Anshul Ahirwar (IIT2018099) | Table Detection and Content Extraction from PDF Document Images | Detecting and recognizing objects in unstructured environments is a difficult task in computer vision research. Table detection in document images is a challenge because the tables are diverse in size and complexity. This work provides an effective way to not only detect Tables but also using OCR on the detected text to extract the required content of the table. |
VR21_P05 | Kisalaya Kishore (IIT2018079), Milan Bhuva (IIT2018176), Mohammed Aadil (IIT2018179), Ankit Rauniyar (IIT2018202) | Interactive Indoor Scene Description to Aid in Navigation for Visually Impaired Individuals using Deep Learning | This work introduces a methodology to help visually impaired person to avoid obstacles in an indoor environment. We use DepthNet-MiDaS large model to get the depth map of the scene, and then use sparse optical flow in parallel to predict the path of interested objects. This is done in order to recognise objects that might cross paths and are of potential danger to the user. |
VR21_P06 | Naukesh Goyal (IIT2018092), N.Lokesh Naik (IIT2018104), Nikhil Kumar (IIT2018152), Vishal Muwal (IIT2018153) | Rating Content Based on Real-Time User Expressions | We are making a custom facial emotion detection system with a validation accuracy of 68.32% on the FER2013 dataset along with a proof of concept web app that captures your facial emotion data while you watch any content in a temporal manner with dynamic granularity so that we can generate better input data for advanced user content recommendation algorithms. There are a few state-of-the-art models with accuracy as high as 71-72 percent but our model is quite a bit faster than those models and it can be used in real-time on mobile devices as the trained model will result in fewer parameters. |
VR21_P07 | Hrutvik Kailas Nagrale (IIT2018088), Aastha Kumari (IIT2018091), Ravi Kumar Sharma (IIT2018108), Ratan Kumar Mandal (IIT2018136) | Real Time Indian Sign Language Recognition | Sign language is one of the oldest and most natural forms of language for communication, but since most people do not know sign language and interpreters are very difficult to find in day to day conversations, we have come up with a real time method for finger spelling based Indian sign language. In our method, the hand is first passed through a median blur filter and canny edge detection algorithm is applied to the filtered image, then feature extraction using SURF is performed on the result and further model of visual words are obtained after clustering, and then svm classifier is trained on histogram computed through these visual words to generate our model. The method used provides 99% accuracy for the 26 letters of the alphabet and 1 to 9 numbers. |
VR21_P08 | Jaya Meena (IIT2018029), Suryasen Singh (IIT2018069), Rahul Yadav (IIT2018071), Vineet Kumar (IIT2018096) | Explicit Image Detection | This paper explores the task of classification of images as an explicit or non-explicit. There are various ways to deal with binary classification of images varies from simple CNN architecture to more sophisticated model as VGG, ResNet, etc. After several hit and trial, We settled with ResNet architecture for our classification task as it had the higher accuracy among other tried models. The dense layer and the output layer of the model fine tuning has been found by hit and trial by recursive testing of the designed model architecture. In this paper we have proposed an approach for detecting images which are considered to be explicit or Not Safe for Work contents and prevent the consumption of such explicit content. The proposed model is a deep learning model which is based upon a residual network. The model returns a numerical value which is a measure of explicitness of the input media content. The numerical value is compared to a further defined threshold to categorize the content into explicit or non-explicit content. |
VR21_P09 | Rithik Seth (IIT2018032), Hardik Kumawat (IIT2018034), Aman Joshi (IIT2018042), Milind Khatri (IIT2018082) | Vision Assistant for Visually Impaired Individuals | Artificial Intelligence has been touted as the next big thing that is capable of altering the current landscape of the technological domain. Through the use of Artificial Intelligence and Machine Learning, pioneering works have been undertaken in the area of Vision and Object Detection. In this work, we undertake the analysis of a Vision Assistant Application for guiding the visually-impaired individuals. With recent break- through in the computer vision and supervised learning models, the problem at hand has been reduced significantly to the point where new models are easier to build and implement over the already existing models. Now, different Object Detection models exist which provides object tracking and detection with great accuracy. These techniques have been well used in automating the detection task in different areas. Some of the newfound detection approaches such as YOLO (You Only Look Once), SSD (Single Shot Detector) and R-CNNs have proved to be consistent and pretty accurate in Real-Time Object Detection. We are going to have a brief look at these techniques in order to find a good base model for implementing our ‘Vision Assistant’. |
VR21_P10 | Rahul Reddy Muppidi (IIT2018103), G Shashank (IIT2018106), A Prathyush (IIT2018124), A Rahul Naidu (IIT2018192) | Building Detection from Aerial Images | In this project, we do building extraction from high resolution aerial images. There are many real life applications like government decision making, civil defense operations, police and Geographic Information Systems. But the complexity of building extraction methods from the very high resolution methods is still a great challenge. Hence using deep learning methods has decreased the complexity and increased the accuracy by great extent. In this project, a framework based on CNN and edge detection is proposed. The method is called Mask R-CNN. This method is applied in Satellite image building extraction. Our method consists of three parts. Quickly identifying buildings in disaster areas plays a major role in disaster assessment. This paper combines traditional digital image processing methods and CNNs. Mask R-CNN improves detection accuracy. As the complexity is also decreased, the computational time is also reduced by a great extent. |
VR21_P11 | Ashwani Rai (IIT2018006), Sanjay Swami (IIT2018014), Sunidhi Kashyap (IIT2018016), Abhishek Mishra (IIT2018026) | Deep CNN Model for Smoke Detection in Normal And Foggy Environment | Smoke detection is very important especially in foggy environment so this is our problem statement “Design a model which can take care of the increase in fire accidents in smart cities”. |
VR21_P12 | Kshitij K. Gautam(IIT2018037), Rahul Thalor (IIT2018070), Divyansh Bhorvanshi (IIT2018072), Sourabh Thakur (IIT2018101) | Hand Gesture Recognition: Contactless ATM | As we know, in a pandemic a lot of people fear going out and are scared of touching anything in public places. But there are some places like ATMs which can’t be accessed without touching. So we came up with a solution to build an Contactless ATM so that people can avoid contact in the ATMs too. |
VR21_P13 | Jishan Singh (IIT2018111), Prabal Tikeriha (IIT2018140), Fahad Ali (IIT2018148), Bhavya Jain (IIT2018151) | Disguised Face Age Estimation | With time, Age estimation models have improved dramatically but what poses challenges to these is human disguises like mask, beard, mustache, etc. In this work, we develop the solution to estimate the age of the disguised person. |
VR21_P14 | Sagar Kumar (IIT2018154), Kartik Nema (IIT2018156), Bhupendra (IIT2018163), Prakhar Srivastava (IIT2018172) | Text to Image Synthesis | In this work we discuss a solution to the problem of text to image synthesis by making use of GANs (Generative Adversarial Networks), we start our discussion with a brief introduction of this field, it’s usefulness, challenges involved etc. Then we discuss in depth about GANs, followed by our own proposed methodology. |
Grading
- C1 (30%): 10% Written + 20% Practice
- C2 (30%): 10% Written + 20% Practice
- C3 (40%): 20% Written + 20% Practice
Prerequisites
- Computer Programming
- Data Structures and Algorithms
- Machine Learning
- Image and Video Processing
- Ability to deal with abstract mathematical concepts
Books
- Computer Vision: Algorithms and Applications, Richard Szeliski, Springer
- Deep Learning, Ian Goodfellow, Aaron Courville, and Yoshua Bengio, MIT Press
Related Classes / Online Resources
Disclaimer
The content (text, image, and graphics) used in this slide are adopted from many sources for Academic purposes. Broadly, the sources have been given due credit appropriately. However, there is a chance of missing out some original primary sources. The authors of this material do not claim any copyright of such material.