Pre-defence workshop

On 2019-08-28 09:20:00 at G205, Karlovo náměstí 13, Praha 2
Three talks given by the reviewers of Filip Radenovic's thesis:

9:20 - 10:05
Diane Larlus: Visual, semantic and cross-modal search
NAVER labs, Grenoble, France

10:05 - 10:20 COFFEE BREAK

10:20 - 11:05
John Collomosse: LiveSketch: Query Perturbations for Guided Sketch-based Visual
CVSSP, University of Surrey, UK

11:05 - 11:50
Josef Sivic: Estimating 3D Motion and Forces of Person-Object Interactions from
Monocular Video
INRIA Paris / CIIRC CTU in Prague

Abstracts of the talks follow:

Diane Larlus: Visual, semantic and cross-modal search

Abstract: Visual search can be formulated as a ranking problem where the goal is
to order a collection of images by decreasing similarity to the query. Recent
deep models for image retrieval have outperformed traditional methods by
leveraging ranking-tailored loss functions such as the contrastive loss or the
triplet loss. Yet, these losses do not optimize for the global ranking. In the
first part of this presentation, we will see how one can directly optimize the
global mean average precision, by leveraging recent advances in listwise loss
formulations. In a second part, the presentation will move beyond instance-level
search and consider the task of semantic image search in complex scenes, where
the goal is to retrieve images that share the same semantics as the query image.
Despite being more subjective and more complex, one can show that the task of
semantically ranking visual scenes is consistently implemented across a pool of
human annotators, and that suitable embedding spaces can be learnt for this task
of semantic retrieval. The last part will focus on cross-modal search. More
specifically, we will consider the problem of cross-modal fine-grained action
retrieval between text and video. Cross-modal retrieval is commonly achieved
through learning a shared embedding space that can indifferently embed
modalities. In this last part we will show how to enrich the embedding space by
disentangling parts-of-speech (PoS) in the accompanying captions.


John Collomosse: LiveSketch: Query Perturbations for Guided Sketch-based Visual

Abstract: Sketch is an intuitive medium for communicating visual concepts, and a
promising direction for re-imagining the search experience on mobile. Yet,
deriving a user’s search intent from a sketched query can be challenging due
to ambiguity – particularly when indexing tens of millions of images e.g.
Adobe Stock. I will present LiveSketch (presented CVPR 2019); a sketch based
visual search system that creates real-time visual augmentations to the query
sketch as it is drawn, making query generation an interactive rather than
one-shot process. Technical contributions of SketchAssist include extending our
state of the art triplet CNN framework to incorporate an RNN based variational
autoencoder to understand queries input in vector (stroke-based) as well as
raster form; real-time clustering to semantically group search suggestions, and
use of an FGSM-like technique (commonly used to create ‘adversarial
examples’ that perturb images to change classifer outcome) to generate the
interactive visual augmentations to the sketch.

Paper: J. Collomosse and T. Bui and H. Jin. "LiveSketch: Query Perturbation for
Guided Sketch-based Visual Search". In Proc. CVPR. 2019.
PDF at


Josef Sivic: Estimating 3D Motion and Forces of Person-Object Interactions from
Monocular Video

Abstract: Understanding person-object interactions is a stepping stone towards
building autonomous machines that learn by observing people how to interact with
the physical world. In this talk, I will describe a method to automatically
reconstruct the 3D motion of a person interacting with an object from a single
RGB video. Our method estimates the 3D poses of the person and the object,
contact positions and forces, and torques actuated by the human limbs.
The main contributions of this work are three-fold. First, we propose an
approach to jointly estimate the motion and the actuation forces of the person
on the manipulated object by modeling contacts and the dynamics of their
interactions. This is cast as a large-scale trajectory optimization problem.
Second, we develop a method to automatically recognize from the input video the
position and timing of contacts between the person and the object or the ground,
thereby significantly simplifying the complexity of the optimization. Finally,
we validate our approach on a recent MoCap dataset with ground truth contact
forces and demonstrate results on a new dataset of Internet videos showing
people manipulating a variety of tools in unconstrained indoor/outdoor

Joint work with: Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev and
Nicolas Mansard.
Responsible person: Petr Pošík