Gurkirt Singh

Computer Vision Laboratory

Sternwartstrasse 7, ETH Zentrum

CH - 8092 Zürich, Switzerland

Office: ETF C114

About Me

See my CV here

I am a postdoctoral researcher with Prof. Luc Van Gool in the Computer Vision Lab at ETH Zurich. I received a Doctor of Philosophy (PhD) in the Artificial Intelligence and Vision Group at Oxford Brookes University in 2019. I was advised by Dr. Fabio Cuzzolin. My PhD research was focused on spatio-temporal action detection and prediction in realistic videos.

Earlier, I was research engineer for two years in imaging and computer vision group at Siemens research India, directed by Amit Kale. In 2013, I graduated from masters in informatics (MOSIG) program at Institut National Polytechnique de Grenoble-INPG (School ENSIMAG) with specialization in Graphics Vision and Robotics (GVR). I completed my master's thesis under the supervision of Dr. Georgios Evangelidis and Dr. Radu HORAUD at INRIA, Grenoble. I received Bachelor of Technology degree in Electronics and Instrumentation Engineering from VIT University, Vellore, during which I had chance do an internship at university of Edinburgh under the supervision of Dr. Bob Fisher.


Jan 2022: ROAD paper is accpted for publication at TPAMI

OCT 2021: ROAD challenge hosted with ICCV 2021 was a success and still avaible for submissions

Feb 2021: ROAD dataset is now online

June 2020: I am co-oragnising SARAS-ESAD challenge at MIDL 2020 in Montreal, Canada

Feb 2020: I joined Computer Vision Lab at ETH Zurich


Charades Challenge, 2017: Acton Recognition, Rank: 2/10, Temporal Action Segmentation, Rank: 3/6.

ActivityNet Challenge, 2017: Untrimmed Video Classification, Rank: 3/29.

ActivityNet Challenge, 2016: Untrimmed Video Classification, Rank: 10/24, Actvity detection, Rank: 2/6.

ChaLearn Looking at People Challenge, 2014 , Gesture detection, Rank: 7/17.

ChaLearn Looking at People Challenge, 2013 , Gesture detection, Rank: 17/54.


ROAD: The ROad event Awareness Dataset for Autonomous Driving
we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicles ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset annotated with bounding boxes showing the location in the image plane of each road event. We benchmark various detection tasks, proposing as a baseline a new incremental algorithm for online road event awareness termed 3D-RetinaNet. We also report the performance on the ROAD tasks of Slowfast and YOLOv5 detectors, as well as that of the winners of the ICCV2021 ROAD challenge.
Gurkirt Singh, Stephen Akrigg, ......., & Fabio Cuzzolin
End-to-End Video Captioning
we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a twostage training setting, we first initialise our architecture using pre-trained encoders and decoders – then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process
Silvio Olivastri, Gurkirt Singh, Fabio Cuzzolin.
Predicting Action Tubes
we present a method to predict an entire ‘action tube’ in a trimmed video just by observing a smaller subset of video. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with their action classification scores. At test time TPnet is used in a (temporal) slid- ing window setting, and its predictions are put into a tube estimation framework to construct/predict the video long action tubes not only for the observed part of the video but also for the unobserved part
Gurkirt Singh, Suman Saha, Fabio Cuzzolin.
AHB - ECCVW 2018
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatio-temporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. To avoid this problem we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(n^f) to the cardinality of the thresholded matrix.
Gurkirt Singh, Suman Saha, Fabio Cuzzolin.
ACCV 2018
Online Real-time Multiple Spatiotemporal Action Localisation and Prediction
We present a method for multiple spatiotemporal action localisation, classification, and early prediction based on a single deep learning framework, which able to work in an online and real time contraints.
Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
ICCV 2017
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture.
Dominant approaches provides sub-optimal solutions to the action dection problem, as they rely on seeking frame-level detections and construting tubes from them. In this paper we radically depart from current practice, and take a first step towards the design and implementation of a deep network architecture able to classify and regress video-level micro-tubes.
Suman Saha, Gurkirt Singh, Fabio Cuzzolin.
ICCV 2017
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work we propose a new approach to the spatio-temporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.
Suman Saha, Gurkirt Singh, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
BMVC 2016
Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge
In this work we propose a simple, yet effective, method for the temporal detection of activities in temporally untrimmed videos with the help of untrimmed classification. This method secured the 2nd place at ActivityNet Challenge 2016 in activity detection task [Results]
Gurkirt Singh and Fabio Cuzzolin.
CVPR 2016 ActivityNet workshop, 2nd place in detection task.
Continuous gesture recognition from articulated poses
This paper addresses the problem of continuous gesture recognition from articulated poses. Unlike the common isolated recognition scenario, the gesture boundaries are here unknown, and one has to solve two problems: segmentation and recognition. This is cast into a labeling framework, namely every site (frame) must be assigned a label (gesture ID). The inherent constraint for a piece-wise constant labeling is satisfied by solving a global optimization problem with a smoothness term. This mehtod secured 7th place in gesture detection task in ChaLearn LaP Challenge using only skeleton data.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
ECCV 2014 workshop
Skeletal Quads:Human action recognition using joint quadruples
In this context, we propose a local skeleton descriptor that encodes the relative position of joint quadruples. Such a coding implies a similarity normalisation transform that leads to a compact (6D or 5D) view-invariant skeletal feature, referred to as skeletal quad. In the references below, we use this descriptor in conjunction with FIsher kernel in order to encode gesture or action (sub)sequences. The short length of the descriptor compensates for the large inherent dimensionality associated to Fisher vectors.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
ICPR 2014
Frame-wise representations of depth videos for action recognition
We present three types of depth data representation from depth frames, which are referred as single-reference representation, multiple-reference representation and Quad representation.
Gurkirt Singh
Master thesis, INRIA and Grenoble Institute of Technology, France, 2013
Supervisors: Dr. Radu Horaud and Dr. Georgios Evangelidis
Categorizing Abnormal behavior from and indoor overhead camera
We propose an approach using overhead camera to detect abnormal activties with help trajectory classification.
Gurkirt Singh
Bachelor thesis, University of Edinburgh and VIT University, 2010
Supervisor: Dr. Bob Fisher


I made an attempt to compile recent works on action recognition in more searchable format. Check it out on my older page

My old research page, it has an intresting review of action recognition and prediction works.

Citation Graph: part of submission from reading Group-1 at ICVSS 2016 (only works in firefox)

Flag Counter