Gurkirt Singh
Sternwartstrasse 7, ETH Zentrum
CH - 8092 Zürich, Switzerland
Office: ETF C114
News
June 2020: I am co-oragnising SARAS-ESAD challenge at MIDL 2020 in Montreal, Canada
Feb 2020: I joined Computer Vision Lab at ETH Zurich
Aug 2019: I am selected in Doctoral Consortium ICCV 2019
Aug 2019: I am selected as the best reviewer for ICCV 2019
June 2019: Presenting causal representations work at Oxford Robotics Research Group Seminars, see you there on 24th
Sept 2018: Our paper on "Transition Matrix network" is accpted at ACCV Perth, 2018
Aug 2018: Our paper on "Predicting Action Tubes" is accpted at AHB2018 workshop at ECCV, 2018
July 2018: Our paper on "Incremental Tube Construction for Human Action Detection" is accpted at BMVC, York, 2018
Dec 2017: Pytorch implementation of our work on Online Real-time action Detection is available on GitHub
Dec 2017: Pytorch implementation of Two stream InceptionV3 trained for action recognition using Kinetics dataset is available on GitHub
July 2017: My work at Disney Research Pittsburgh with Leonid Sigal and Andreas Lehrmann secured 2nd place in charades challenge,
second only to DeepMind entery
July 2017: Two paper got accpted at ICCV 2017, avaiable below.
June 2016: Our paper on "Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos" is accpted at BMVC, York, 2016
June 2016: Our team secures the 2nd place at ActivityNet Challenge 2016 in activity detection task [Results].
Our approach is described in arxiv technical report.
Contests
Charades Challenge, 2017: Acton Recognition, Rank: 2/10, Temporal Action Segmentation, Rank: 3/6.
ActivityNet Challenge, 2017: Untrimmed Video Classification, Rank: 3/29.
ActivityNet Challenge, 2016: Untrimmed Video Classification, Rank: 10/24, Actvity detection, Rank: 2/6.
ChaLearn Looking at People Challenge, 2014 , Gesture detection, Rank: 7/17.
ChaLearn Looking at People Challenge, 2013 , Gesture detection, Rank: 17/54.
Publications
Recurrent Convolutions for Causal 3D CNNs
we propose a novel Recurrent Convolutional Network (RCN), which relies on recurrence to capture the temporal context across frames at each network level.
Our network decomposes 3D convolutions into (1) a 2D spatial convolution component, and (2) an additional hidden state
1 × 1 convolution, applied across time.
The hidden state at any time t is assumed to depend on the hidden state at t − 1
and on the current output of the spatial convolution component.
As a result, the proposed network: (i) produces causal outputs, (ii) provides flexible temporal reasoning, (iii) preserves temporal resolution.
Gurkirt Singh, Fabio Cuzzolin.
End-to-End Video Captioning
we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion.
In a twostage training setting, we first initialise our architecture using pre-trained encoders and decoders – then,
the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation.
In our experiments, we use GoogLeNet and Inception-ResNetv2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder.
Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training
process
Silvio Olivastri, Gurkirt Singh, Fabio Cuzzolin.
Predicting Action Tubes
we present a method to predict an entire ‘action tube’ in a trimmed video just by observing a smaller subset of video. We propose a Tube Prediction network (TPnet) which jointly predicts the past, present and future bounding boxes along with
their action classification scores. At test time TPnet is used in a (temporal) slid-
ing window setting, and its predictions are put into a tube estimation framework
to construct/predict the video long action tubes not only for the observed part of
the video but also for the unobserved part
Gurkirt Singh, Suman Saha, Fabio Cuzzolin.
AHB - ECCVW 2018
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatio-temporal action localisation by extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate sets of temporally connected bounding boxes called action micro-tubes. To avoid this problem we introduce a Transition-Matrix-based Network (TraMNet) which relies on computing transition probabilities between anchor proposals while maximising their overlap with ground truth bounding boxes across frames, and enforcing sparsity via a transition threshold. As the resulting transition matrix is sparse and stochastic, this reduces the proposal hypothesis search space from O(n^f) to the cardinality of the thresholded matrix.
Gurkirt Singh, Suman Saha, Fabio Cuzzolin.
ACCV 2018
Online Real-time Multiple Spatiotemporal Action Localisation and Prediction
We present a method for multiple spatiotemporal action localisation,
classification, and early prediction based on a single deep learning framework,
which able to work in an online and real time contraints.
Gurkirt Singh, Suman Saha, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
ICCV 2017
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture.
Dominant approaches provides sub-optimal solutions to the action dection problem, as they rely on seeking frame-level detections and construting tubes from them. In this paper we radically depart from current practice, and take a first step towards the design and implementation of a deep network architecture able to classify and regress video-level micro-tubes.
Suman Saha, Gurkirt Singh, Fabio Cuzzolin.
ICCV 2017
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos
In this work we propose a new approach to the spatio-temporal
localisation (detection) and classification of multiple concurrent actions within
temporally untrimmed videos. We demonstrate the performance of our algorithm on the challenging
UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving new state-of-the-art results
across the board and significantly lower detection latency at test time.
Suman Saha, Gurkirt Singh, Michael Sapienza, Philip Torr, Fabio Cuzzolin.
BMVC 2016
Untrimmed Video Classification for Activity Detection: submission to ActivityNet Challenge
In this work we propose a simple, yet effective, method for the temporal detection
of activities in temporally untrimmed videos with the help of untrimmed classification.
This method secured the
2nd place at ActivityNet
Challenge 2016 in activity detection task [
Results]
Gurkirt Singh and Fabio Cuzzolin.
CVPR 2016 ActivityNet workshop, 2nd place in detection task.
Continuous gesture recognition from articulated poses
This paper addresses the problem of continuous gesture recognition from articulated poses.
Unlike the common isolated recognition scenario, the gesture boundaries are here unknown,
and one has to solve two problems: segmentation and recognition.
This is cast into a labeling framework, namely every site (frame) must be assigned a label (gesture ID).
The inherent constraint for a piece-wise constant labeling is satisfied by solving a
global optimization problem with a smoothness term.
This mehtod secured 7th place in gesture
detection task in ChaLearn LaP Challenge using only skeleton data.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
ECCV 2014 workshop
Skeletal Quads:Human action recognition using joint quadruples
In this context, we propose a local skeleton descriptor that encodes
the relative position of joint quadruples. Such a coding implies a
similarity normalisation transform that leads to a compact (6D or 5D)
view-invariant skeletal feature, referred to as skeletal quad.
In the references below, we use this descriptor in conjunction
with FIsher kernel in order to encode gesture or action (sub)sequences.
The short length of the descriptor compensates for the large inherent
dimensionality associated to Fisher vectors.
Georgios Evangelidis, Gurkirt Singh, Radu Patrice Horaud.
ICPR 2014
Frame-wise representations of depth videos for action recognition
We present three types of depth data representation from depth frames, which are referred as single-reference representation, multiple-reference representation and Quad representation.
Gurkirt Singh
Master thesis, INRIA and Grenoble Institute of Technology, France, 2013
Supervisors: Dr. Radu Horaud and Dr. Georgios Evangelidis
Categorizing Abnormal behavior from and indoor overhead camera
We propose an approach using overhead camera to detect abnormal activties with help trajectory classification.
Gurkirt Singh
Bachelor thesis, University of Edinburgh and VIT University, 2010
Supervisor: Dr. Bob Fisher