Gurkirt Singh

Computer Vision Laboratory

Sternwartstrasse 7, ETH Zentrum

CH - 8092 Zürich, Switzerland

Office: ETF C114

About Me

See my CV here

I am a postdoctoral researcher with Prof. Luc Van Gool in the Computer Vision Lab at ETH Zurich. I received a Doctor of Philosophy (PhD) in the Artificial Intelligence and Vision Group at Oxford Brookes University in 2019. I was advised by Dr. Fabio Cuzzolin. My PhD research was focused on spatio-temporal action detection and prediction in realistic videos.

Earlier, I was research engineer for two years in imaging and computer vision group at Siemens research India, directed by Amit Kale. In 2013, I graduated from masters in informatics (MOSIG) program at Institut National Polytechnique de Grenoble-INPG (School ENSIMAG) with specialization in Graphics Vision and Robotics (GVR). I completed my master's thesis under the supervision of Dr. Georgios Evangelidis and Dr. Radu HORAUD at INRIA, Grenoble. I received Bachelor of Technology degree in Electronics and Instrumentation Engineering from VIT University, Vellore, during which I had chance do an internship at university of Edinburgh under the supervision of Dr. Bob Fisher.


June 2020: I am co-oragnising SARAS-ESAD challenge at MIDL 2020 in Montreal, Canada

Feb 2020: I joined Computer Vision Lab at ETH Zurich

Aug 2019: I am selected in Doctoral Consortium ICCV 2019

Aug 2019: I am selected as the best reviewer for ICCV 2019

June 2019: Presenting causal representations work at Oxford Robotics Research Group Seminars, see you there on 24th

Sept 2018: Our paper on "Transition Matrix network" is accpted at ACCV Perth, 2018

Aug 2018: Our paper on "Predicting Action Tubes" is accpted at AHB2018 workshop at ECCV, 2018

July 2018: Our paper on "Incremental Tube Construction for Human Action Detection" is accpted at BMVC, York, 2018

Dec 2017: Pytorch implementation of our work on Online Real-time action Detection is available on GitHub

Dec 2017: Pytorch implementation of Two stream InceptionV3 trained for action recognition using Kinetics dataset is available on GitHub

July 2017: My work at Disney Research Pittsburgh with Leonid Sigal and Andreas Lehrmann secured 2nd place in charades challenge, second only to DeepMind entery

July 2017: Two paper got accpted at ICCV 2017, avaiable below.

June 2016: Our paper on "Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos" is accpted at BMVC, York, 2016

June 2016: Our team secures the 2nd place at ActivityNet Challenge 2016 in activity detection task [Results]. Our approach is described in arxiv technical report.


Charades Challenge, 2017: Acton Recognition, Rank: 2/10, Temporal Action Segmentation, Rank: 3/6.

ActivityNet Challenge, 2017: Untrimmed Video Classification, Rank: 3/29.

ActivityNet Challenge, 2016: Untrimmed Video Classification, Rank: 10/24, Actvity detection, Rank: 2/6.

ChaLearn Looking at People Challenge, 2014 , Gesture detection, Rank: 7/17.

ChaLearn Looking at People Challenge, 2013 , Gesture detection, Rank: 17/54.


Recurrent Convolutions for Causal 3D CNNs
we propose a novel Recurrent Convolutional Network (RCN), which relies on recurrence to capture the temporal context across frames at each network level. Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state 1 × 1 convolution, applied across time. The hidden state at any time t is assumed to depend on the hidden state at t − 1 and on the current output of the spatial convolution component. As a result, the proposed network: (i) produces causal outputs, (ii) provides flexible temporal reasoning, (iii) preserves temporal resolution.
Gurkirt Singh, Fabio Cuzzolin.
End-to-End Video Captioning
we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a twostage training setting, we first initialise our architecture using pre-trained encoders and decoders – then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process
Silvio Olivastri, Gurkirt Singh, Fabio Cuzzolin.
Predicting Action Tubes
we present a method to predict an entire ‘action tube’ in a trimmed video just by observing a smaller subset of video. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with their action classification scores. At test time TPnet is used in a (temporal) slid- ing window setting, and its predictions are put into a tube estimation framework to construct/predict the video long action tubes not only for the observed part of the video but also for the unobserved part
Gurkirt Singh, Suman Saha, Fabio Cuzzolin.
AHB - ECCVW 2018
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatio-temporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. To avoid this problem we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(n^f) to the cardinality of the thresholded matrix.
Gurkirt Singh, Suman Saha, Fabio Cuzzolin.
ACCV 2018
Online Real-time Multiple Spatiotemporal Action Localisation and Prediction
We present a method for multiple spatiotemporal action localisation, classification, and early prediction based on a single deep learning framework, which able to work in an online and real time contraints.
Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
ICCV 2017
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture.
Dominant approaches provides sub-optimal solutions to the action dection problem, as they rely on seeking frame-level detections and construting tubes from them. In this paper we radically depart from current practice, and take a first step towards the design and implementation of a deep network architecture able to classify and regress video-level micro-tubes.
Suman Saha, Gurkirt Singh, Fabio Cuzzolin.
ICCV 2017
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work we propose a new approach to the spatio-temporal localisation (detection) and classification of multiple concurrent actions within temporally untrimmed videos. We demonstrate the performance of our algorithm on the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results across the board and significantly lower detection latency at test time.
Suman Saha, Gurkirt Singh, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
BMVC 2016
Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge
In this work we propose a simple, yet effective, method for the temporal detection of activities in temporally untrimmed videos with the help of untrimmed classification. This method secured the 2nd place at ActivityNet Challenge 2016 in activity detection task [Results]
Gurkirt Singh and Fabio Cuzzolin.
CVPR 2016 ActivityNet workshop, 2nd place in detection task.
Continuous gesture recognition from articulated poses
This paper addresses the problem of continuous gesture recognition from articulated poses. Unlike the common isolated recognition scenario, the gesture boundaries are here unknown, and one has to solve two problems: segmentation and recognition. This is cast into a labeling framework, namely every site (frame) must be assigned a label (gesture ID). The inherent constraint for a piece-wise constant labeling is satisfied by solving a global optimization problem with a smoothness term. This mehtod secured 7th place in gesture detection task in ChaLearn LaP Challenge using only skeleton data.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
ECCV 2014 workshop
Skeletal Quads:Human action recognition using joint quadruples
In this context, we propose a local skeleton descriptor that encodes the relative position of joint quadruples. Such a coding implies a similarity normalisation transform that leads to a compact (6D or 5D) view-invariant skeletal feature, referred to as skeletal quad. In the references below, we use this descriptor in conjunction with FIsher kernel in order to encode gesture or action (sub)sequences. The short length of the descriptor compensates for the large inherent dimensionality associated to Fisher vectors.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
ICPR 2014
Frame-wise representations of depth videos for action recognition
We present three types of depth data representation from depth frames, which are referred as single-reference representation, multiple-reference representation and Quad representation.
Gurkirt Singh
Master thesis, INRIA and Grenoble Institute of Technology, France, 2013
Supervisors: Dr. Radu Horaud and Dr. Georgios Evangelidis
Categorizing Abnormal behavior from and indoor overhead camera
We propose an approach using overhead camera to detect abnormal activties with help trajectory classification.
Gurkirt Singh
Bachelor thesis, University of Edinburgh and VIT University, 2010
Supervisor: Dr. Bob Fisher


I made an attempt to compile recent works on action recognition in more searchable format. Check it out on my older page

My old research page, it has an intresting review of action recognition and prediction works.

Citation Graph: part of submission from reading Group-1 at ICVSS 2016 (only works in firefox)