Saima Nazir presents Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) Model for Human Action Recognition
On 2026-05-19 11:00:00 at G205, Karlovo náměstí 13, Praha 2
Human action recognition (HAR) has emerged as a core research domain for video
understanding and analysis, thus attracting many researchers. Although
significant results have
been achieved in simple scenarios, HAR is still a challenging task due to issues
associated with view
independence, occlusion and inter-class variation observed in realistic
scenarios. In previous
research efforts, the classical bag of visual words approach along with its
variations has been widely
used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions
(D-STBoE) model
for human action recognition without compromising the strengths of the classical
bag of visual
words approach. Expressions are formed based on the density of a spatio-temporal
cube of a visual
word. To handle inter-class variation, we use class-specific visual word
representation for visual
expression generation. In contrast to the Bag of Expressions (BoE) model, the
formation of visual
expressions is based on the density of spatio-temporal cubes built around each
visual word, as
constructing neighborhoods with a fixed number of neighbors could include
non-relevant
information making a visual expression less discriminative in scenarios with
occlusion and changing
viewpoints. Thus, the proposed approach makes the model more robust to occlusion
and changing
viewpoint challenges present in realistic scenarios. Furthermore, we train a
multi-class Support
Vector Machine (SVM) for classifying bag of expressions into action classes.
Comprehensive
experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and
UCF50 show that the
proposed model outperforms existing state-of-the-art human action recognition
methods in term of
accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.
[Saima Nazir has just started as a post-doc in Jan Kybic's group. The work to be
presented is the core of her PhD.]
understanding and analysis, thus attracting many researchers. Although
significant results have
been achieved in simple scenarios, HAR is still a challenging task due to issues
associated with view
independence, occlusion and inter-class variation observed in realistic
scenarios. In previous
research efforts, the classical bag of visual words approach along with its
variations has been widely
used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions
(D-STBoE) model
for human action recognition without compromising the strengths of the classical
bag of visual
words approach. Expressions are formed based on the density of a spatio-temporal
cube of a visual
word. To handle inter-class variation, we use class-specific visual word
representation for visual
expression generation. In contrast to the Bag of Expressions (BoE) model, the
formation of visual
expressions is based on the density of spatio-temporal cubes built around each
visual word, as
constructing neighborhoods with a fixed number of neighbors could include
non-relevant
information making a visual expression less discriminative in scenarios with
occlusion and changing
viewpoints. Thus, the proposed approach makes the model more robust to occlusion
and changing
viewpoint challenges present in realistic scenarios. Furthermore, we train a
multi-class Support
Vector Machine (SVM) for classifying bag of expressions into action classes.
Comprehensive
experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and
UCF50 show that the
proposed model outperforms existing state-of-the-art human action recognition
methods in term of
accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.
[Saima Nazir has just started as a post-doc in Jan Kybic's group. The work to be
presented is the core of her PhD.]
External www: https://www.mdpi.com/1424-8220/19/12/2790