Multi-stream dynamic video Summarization
Mohamed Elfeki1 , Liqiang Wang2 , and Ali Borji
1
Microsoft, 2 University of Central Florida
melfeki@microsoft.com, lwang@cs.ucf.edu, aliborji@gmail.com
Abstract
With vast amounts of video content being uploaded to the
Internet every minute, video summarization becomes critical for efficient browsing, searching, and indexing of visual content. Nonetheless, the spread of social and egocentric cameras creates an abundance of sparse scenarios
captured by several devices, and ultimately required to be
jointly summarized. In this paper, we discuss the problem of
summarizing videos recorded independently by several dynamic cameras that intermittently share the field of view. We
present a robust framework that (a) identifies a diverse set of
important events among moving cameras that often are not
capturing the same scene, and (b) selects the most representative view(s) at each event to be included in a universal
summary. Due to the lack of an applicable alternative, we
collected a new multi-view egocentric dataset, Multi-Ego.
Our dataset is recorded simultaneously by three cameras,
covering a wide variety of real-life scenarios. The footage
is annotated by multiple individuals under various summarization configurations, with a consensus analysis ensuring
a reliable ground truth. We conduct extensive experiments
on the compiled dataset in addition to three other standard
benchmarks that show the robustness and the advantage
of our approach in both supervised and unsupervised settings. Additionally, we show that our approach learns collectively from data of varied number-of-views and orthogonal to other summarization methods, deeming it scalable
and generic.
1. Introduction
In a world where nearly everyone has several mobile cameras ranging from smart-phones to body-cameras,
brevity becomes no longer an accessory. It is rather essential to efficiently extract relevant contents from this immense array of static and moving cameras. Video summarization aims at selecting a set of frames from a visual sequence that contains the most important and representative
events. Not only is summarization useful for efficiently extracting the data substance, it also serves many other ap-
Figure 1: Several views are recorded independently and intermittently overlap their fields-of-view. Our approach dynamically accounts for inter- and intra-view dependencies, providing a comprehensive summary of all views.
plications such as video indexing [14], video retrieval [50],
and anomaly detection [7].
We consider a generic setting where multiple users
record egocentric footage that is both spatially and temporally independent. Users are allowed to move freely in
an uncontrolled environment. As such, cameras’ fields-ofview may or may not overlap through the sequence. Unlike fixed-camera videos, egocentric footage often displays
rapid changes in illumination, unpredictable camera motion, unusual composition and viewpoints, and often complex hand-object manipulations. Accordingly, the desired
summary should include a diverse set of events from all
viewpoints and resist the egocentric noise. Specifically,
there are two types of important events to be included in
the universal summary. First, events where multiple views
have a substantial overlap, in which the summary include
the most representative view. Second, events that are spatially independent, in which each view is processed separately from the rest.
This setting presents itself in several real-life scenarios
339
where many egocentric videos are required to be summarized collectively. For instance, rising claims of police misconduct led to a proliferation of body cameras recordings
[45, 2]. Typical police patrols contain multiple officers
working 10-12 hour shifts. Although it is crucial to thoroughly inspect key details, manually going through 10-hour
video content is extremely challenging and prone to human
errors. Multiplying shift lengths by the number of officers
on duty, it is obvious that there are copious amounts of data
to analyze with no guiding index. A similar example occurs
at social events such as concerts, music shows, and sports
games. Those events tend to be recorded by many several
cameras simultaneously that are dynamically changing their
fields-of-view. Nevertheless, the final highlight summary of
such events is likely to contain frames from all cameras.
Despite considerable progress in single-view video summarization for both egocentric and fixed cameras (e.g.,
[55, 38, 9, 29]), those techniques are not readily applicable
to summarizing multi-view videos. Single-view summarizers ignore the temporal order by processing simultaneouslyrecorded views in a sequential order to fit as a single-view
input. This results in redundant and repetitive summaries
that do not exhibit the multi-stream nature of the footage.
On the other end of spectrum, the literature of multi-view
video summarization mainly focuses on fixed surveillance
camera summarization (e.g., [35, 34]). This enables some
methods to rely on geometric alignment of cameras inferring the relationship between their fields-of-view and utilizing it for a representative summary (e.g., [1, 8]). Thus,
previous work mostly uses unsupervised methods that are
based on heuristic-based objective functions, which are not
suitable to a dynamic change in cameras’ geometric positioning. A key motivation for our work is to generalize
the multi-stream summarization to accommodate dynamic
cameras and extend the capacity of existing supervised and
unsupervised summarization techniques.
Contributions. We extend single-view and fixedcameras methods to be applied on the generalized multistream dynamic-cameras setting. We propose a new adaptation of the widely used Determinantal Point Process
(DPP) [55, 29, 9, 42], Multi-DPP, generalizes it to accommodate multi-stream setting while maintaining the temporal
order. Our approach is orthogonal to other summarization
approaches and can be embedded with fixed- or movingcameras and operating on a supervised or unsupervised setting. Furthermore, our method is shown to be scalable (can
be trained on labels of any available number-of-views in the
supervised setting) and generic (encompasses both singleview and fixed-cameras settings as special cases). Since
no existing dataset is readily applicable to evaluate such
setting, we collect and annotate a new dataset, Multi-Ego.
With extensive experiments, we show that our method outperforms state-of-the-art supervised and unsupervised base-
lines on our generic configuration as well as the special case
of fixed-cameras multi-view summarization.
2. Related Work
Single-View Video Summarization Among many approaches proposed for summarizing single-view videos supervised approaches usually stood out with best performances. In such a setting, the purpose is to simulate the patterns that people exhibit when performing the summarization task, by using human-annotated summaries. There are
two-factor influence the supervised models’ performance:
(a) reliability of annotations, and (b) framework’s modeling
capability. Ensuring the reliability of annotations is evaluated based on a consensus analysis as in several benchmark
datasets [27, 43, 22]. As for the modeling capabilities, supervised approaches vary in their modeling complexity and
effectiveness [9, 12, 54, 11, 53, 6].
Recently, [40] proposed to use convolutional sequences
to summarize videos in both supervised and unsupervised
settings. By formulating the problem as a sequence labeling problem, they established a connection between semantic segmentation and video summarization and used networks trained on the former to improve the latter. Others
have formulated the summarization problem within a reinforcement learning paradigm either with an explicit classification reward as in [57] or a more subtle diversityrepresentativeness reward [58]. Both approaches provided
relatively competitive results on single-view, nonetheless
they suffer from unstable training in the multi-view setting
as we detail in the experiments section.
Recurrent Neural Networks in general, and Long ShortTerm Memory (LSTM) [13] in particular has been widely
used in video processing to obtain the temporal features
in videos [47, 33, 59, 26]. In the recent years, using
LSTMs has been a common practice to solve video summarization problem [15, 44, 51, 56, 52, 25, 4]. For example, Zhang et al. [55] use a mixture of Bi-directional
LSTMs (Bi-LSTM) and Multi-Layer Perceptron to summarize single-view videos in a supervised manner. They maximize the likelihood of Determinantal point processes (DPP)
measure[21, 10, 48] to enforce diversity within the selected
summary. Also, Mahasseni et al. [29] present a framework
that adversarially trains LSTMs, where the discriminator is
used to learn a discrete similarity measure for training the
recurrent encoder/decoder and the frame selector LSTMs.
Multi-view Video Summarization Most multi-view
summarization methods tend to rely on feature selection in
an unsupervised optimization paradigms [32, 34, 35, 39,
30]. Fu et al. [8] introduce the problem of multi-view video
summarization as tailored for fixed surveillance cameras.
They construct a spatiotemporal graph and formulate the
problem as a graph-labeling task. Similarly, in [35, 34, 30]
340
authors assume that cameras in a surveillance camera network have a considerable overlap in their fields-of-view.
Therefore they apply well-crafted objective functions that
learn an embedding space and jointly optimize for a succinct representative summary. Since those approaches target fixed surveillance cameras, they rightfully assume a significant correlation among the frames along the same view
over time. In our generalized setting, cameras move dynamically and contain rapid changes in the field-of-view
rendering the aforementioned assumption weak and make
the problem harder to solve.
Multi-Video Summarization Unlike multi-view, multivideo [49, 24] focuses on spatio-temporally independent
videos and thus, can be processed individually. The key
challenge is scalling the framework onto a large number of
input videos. [16] formulated the problem into finding the
dominant sets in a hypergraph. Then, refine these keyframe
candidates using the web images of the same query. Recently, [17] proposed a similar method that differs in using
a multi-modal weighted archetypal analysis instead of a hypergraph as a structure of the large number of web videos.
3. Multi-Ego: A new multi-view egocentric
summarization dataset
While a number of multi-view datasets exist (e.g. [8,
32]), none of them are recorded in egocentric perspective.
Therefore, we collect our own data that aligns with the established problem setting. We asked three users to independently collect a total of 12 hours of egocentric videos while
performing different real-life activities. Data covers various
uncontrolled environments and activities. We also ensured
to present different levels of interactions among the individuals: (a) two views interacting while the third one is independent, (b) all views interacting with each other, and (c)
all views independent of each other. Then, we extracted 41
different sequences that vary in length from three to seven
minutes. Each sequence contains three views covering a variety of indoors and outdoors activities. We made the data
more accessible for training and evaluation by grouping the
sequences into 6 different collections.
To put our dataset size (41 videos of 3-7 minutes) in perspective, we refer to the most commonly used summarization benchmarks: SumMe (25 videos of 2-4 minutes), TVSum (50 videos of 2-4 minutes) [43], Office (4 videos of
11 minutes), Lobby (3 videos of 8 minutes) and Campus (4
videos of 15 minutes) [8, 32]. Even though that collecting
larger sizes and longer videos is desirable, nonetheless, annotating simultaneously collected views by several annotators is a notoriously hard task. In the following section, we
shed some light on the difficulties encountered in that task
and we propose annotating-in-stages approach to reduce the
annotation uncertainty. More details about data-collection
and a behavioral analysis on the obtained annotations are
provided in supplementary materials.
3.1. Collecting User Annotations
To annotate and process the data for the summarization
task, we sub-sample the videos uniformly to one fps following [42]. Then, every three consecutive frames are combined to construct a shot for an easier display to annotators.
The number of frames per shot was chosen empirically to
maintain a consistent activity within one shot.
We asked five human annotators to perform a three-stage
annotation task. In stage one, they were asked to choose the
most interesting and informative shots that represent each
view independently without any consideration towards the
other views. To construct two-view summaries in stage two,
we only displayed the first two views simultaneously, while
asking the users to select the shots from any of the two
views that best represent both cameras. Similar to stage
two, in stage three the users were asked to select shots
from any of the three views that best represent all the cameras. It is worth noting that the annotators were not limited
to choose only one view of a certain shot, and they could
choose as many as they deem important.
The annotating-in-stages procedure explained above
was employed due to the human’s limited capability in
keeping track of unfolding storylines along multiple views
simultaneously. Consequently, using this technique resulted
in a significant improvement in the consensus between user
summaries compared to when we initially collected summaries in an unordered annotation task.
3.2. Analyzing User Annotations
To ensure the reliability and consistency of the obtained
annotations, we perform a consensus analysis using two
metrics: average pairwise f1-measure and selection ratio.
Following [43, 42, 38], we compute the average pairwise
f1-measure to estimate the frame-level overlap and agreement. We calculated the f1-measure for all possible pairs
of users’ annotations and averaged the results across all the
pairs, obtaining an average of 0.803, 0.762, and 0.834 for
the first, second, and third stage respectively.
3.3. Creating Oracle Summaries
Finally, training a supervised method usually requires a
single set of labels. That means in our case, we need to
use only one summary per video, which is often referred
to as Oracle Summary. To create an oracle summary using
multiple human-created summaries, we follow [9, 20] to
greedily choose the shot that results in the largest marginal
gain on the f-score, and iteratively keep repeating the greedy
selection until the length of the summary reaches 15% of the
single-view length.
341
4. Approach
4.1. Determinantal Point Process (DPP)
DPP is a probabilistic measure that provides a tractable
and efficient means to capture negative correlation with respect to a similarity measure [28, 21]. Formally, a discrete point process P on a ground set Y is a probability measure on the power set 2N , where N = |Y| is the
ground set size. A point process P is called determinantal
if P(y ⊆ Y ) ∝ det(Ly ); ∀ y ⊆ Y . Y is the selection random variable sampled according to P and L is a symmetric
semi-definite positive matrix representing the kernel.
Kulesza et al. [19] proposed modeling the marginal kernel L as a Gram matrix in the followingY
manner:
P(y = Y ) ∝ det(Φ⊤
qi2 ,
(1)
y Φy )
i∈y
When optimizing the DPP kernel, this decomposition learns
a “quality score" of each item, where qi ≥ 0. It also allows learning a feature vector Φy of subset y ⊆ Y. In
this case, the dot product Φy = [φi |...|φj ], where φ⊤
i φj ∈
[−1, 1]; ∀i, j ∈ y is evaluated as a “pair-wise similarity
measure" between the features of item i, φi and the features
of item j, φj . Thus, the DPP marginal kernel Ly can be used
to quantify the diversity within any subset y selected from
a ground set Y. Choosing a diverse subset is equivalent
to a brief representative subset since the redundancy is being minimized. Hence, it is only natural that a considerable
number of document and video summarization approaches
use this measure to extract representative summaries of documents and videos [20, 29, 9, 48].
4.2. Adapting DPP to Multi-stream: Multi-DPP
The standard DPP process described above is suitable
for selecting a diverse subset from a single ground set.
However, when presented with several temporally-aligned
ground sets {Y1 , Y2 , ..., YM }, the standard process can only
be applied in one of two settings: either (a) merging all
the ground sets into a single ground set Y merge = {Y1 ∪
Y2 ∪ ... ∪ YM } and selecting a diverse subset out of the
merged ground set, or (b) selecting a diverse subset from
each ground set and then merging all the selected subsets
Y merge = {Y1 ∪ Y2 ∪ ... ∪ YM }.
Even though that the former setting preserves the information of all elements of the ground sets, but it causes the
complexity of the subset selection to exponentially grow. In
practice, this leads to an accumulation of error due to overflow and underflow computations as well as substantially
slower running-time. Additionally, latter setting assumes
no-intersection between features of the different groundsets. This is essentially inapplicable if the ground-sets have
a significant dynamic feature overlap, leading to redundancy and compromising the very purpose of the DPP. To
address these shortcomings, we propose a new adaptation
of Eq. 1, called Multi-DPP.
In Multi-DPP, ground sets are processed in parallel allowing any potential feature overlap across the ground sets
to be processed temporally-appropriate and keeping a linear growth with respect to the number of streams. For every
element in the ground sets, we need to represent two joint
quantities: features and quality, such that they follow the
following four characteristics. First, we need a model that
can operate on any number of streams (i.e., generic to any
number of ground sets M ). Second, we need a joint representation of the features at each index, such that it only
selects the most effective ones (i.e., invariance to noise and
non-important features). Third, we need a joint representation of the qualities at each index, such that is affected by
the quality of each ground set at a particular index (i.e., variance to the quality of each ground set). Forth, we need to
ensure that our adaptation follows the DPP decomposition
in Eq. 1, by selecting joint features φ⊤
i φj ∈ [−1, 1], and
joint qualities qi ≥ 0; ∀i, j ∈ y.
To account for joint features, we apply max-pooling
choosing the most effective features across all ground sets
at every index, which satisfies the feature decomposition in
Eq. 1. Selecting joint qualities -on the other hand- needs
to account for the quality of each ground set in every index. We use the product of all the qualities at each index.
This deems the joint quality at each index to be dependent
on all ground-sets while also ensuring q m ≤ 1. Therefore,
we generalize the Determinantal Point Process based on the
decomposition in Eq. 1 as follows:
M Y
Y
P(Y = y) ∝ det(Φ⊤
Φ
)
[qim ]2
y y
(2)
m=1 i∈ym
φj = max(φ1j , ..., φM
j ) ; ∀j ∈ y
where M is the number of the ground sets and ym is the
subset selected from ground set m. This decomposition allows both a scalable multi-stream (by constructing a joint
feature representation with max-pooling), and monitoring
the egocentric-introduced noise (by learning an independent
quality measure for each view at each time-step).
Summarizing videos using Multi-DPP. Since MultiDPP formulation of Eq. 2 does not require any extra supervisory signals, it can be adopted to an optimization formula
for both supervised and unsupervised training. In particular, we follow [21] in defining the similarity measure of supervised summarization approaches based on a Maximum
Likelihood Estimation of the Multi-DPP measure with respect to the ground-truth labels as follows:
X
log P (Y (i) = y (i)∗ ; L(i) (θ)
(3)
θ∗ = argmaxθ
i
where θ is the set of supervised parameters, y ∗ is the target
subset (i.e., ground-truth) and i indexes training examples.
For unsupervised summarization, we define the MultiDPP loss based on a diversity regularization introduced in
342
Figure 2: Multi-DPP is applied to increase diversity within the
selected time-steps. When view labels are available, we also use
cross-entropy to learn representative view(s) at each time-step.
[29] that aims to only increase diversity since no summary
labels are being provided.
θ∗ = argmaxθ log P (Y ; L(i) (θ)
(4)
where θ is the set of unsupervised parameters.
Finally we note that our supervised and unsupervised
adaptations are orthogonal to other summarization approaches and can be embedded to allow any DPP-based approach (e.g., [55, 29, 3, 41, 5]) to summarize multi-stream
data while preserving the temporal order and monitoring
the quality of a dynamic input. Additionally, Multi-DPP
is equivalent to the standard DPP decomposition in Eq.1
when M = 1 at Eq.2. This renders Multi-DPP summarization approach as a generalization of the standard singleview summarization DPP approaches as well as orthogonal
to other summarization approaches that allows them to process multi-stream data in a proper temporal order. The discussed theoretical advantage of such generalization will be
further analyzed empirically at Section 5.3.
4.3. Summarization Framework
Figure 2 shows the input as M independent views, with
N frames at each view. We follow [55, 29, 3, 41] in constructing features of each frame across the streams. First,
spatial features are extracted from each frame at each view
using a pre-trained CNN. Then, spatial features are temporally processed using a Bidirectional LSTM layer. By
aggregating both spatial and temporal features, we obtain a
comprehensive spatio-temporal feature vector of each frame
at each view. We choose to share the weights of the BiLSTM layer across the views for two reasons: (a) it allows
the system to operate on any number of views without increasing the number of trainable parameters which alleviates overfitting, and (b) learning temporal features is independent of the view, thus it utilizes data from all views to
produce better temporal modeling.
We break down our objective into two tasks: selecting
diverse events and identifying the view(s) contributing to
illustrating each selected event in summary. In first task,
to select diverse events, we construct a feature set accounting for all the views at each time-step. We do so by maxpooling the spatio-temporal features from all the views,
resulting in the most prominent feature at each index of
the feature vector. We follow max-pooling by a two-layer
Multi-Layer Perceptron (MLP) that applies non-linear activation on joint features that are represented as Φ in Eq. 2.
The second task, however, is used to identify the most
representative view(s) at each event. We use a two-layer
MLP that classifies each view at each time step. Formulating this task as a classification problem serves three purposes. First, it selects the views that are included in the
summary, which is an intrinsic part of the solution. Second,
it regularizes the process of learning the importance of each
event by not selecting any view when the time-step is nonimportant. Finally, the classification confidence of view m
can be used to represent the quality (qnm ) at time-step n.
This is later used to compute the Multi-DPP measure that
determines which time-steps are selected. In the case of
non-overlapping views, the framework may need to select
multiple views at the same time-step. That’s why, we conduct an independent view classification by applying binary
classification, which allows classifying each view independently from the rest.
Similar to the weights of the Bi-LSTM, the view classifier MLP weights are also shared across the views for two
reasons. First, it uses the same number of trainable parameters for any number-of-views data, resulting in fewer trainable parameters which limit the problem of overfitting to
training data. Second, it establishes a view-dependent classification. That is, at any time-step, choosing a representative view among all the views is affected by the relative
quality of all the views, rather than each one independently.
During training, we start by estimating the quality qnm of
each view m at each time-step n, which serves as the view
selection. Then we evaluate Multi-DPP measure by merging the computed qnm with the joint-features Φ as in Eq. 2.
In our supervised setting, we optimize the view(s) selection procedure
by using the binary cross-entropy objecPM P
N
1
m
m
m m
tive: − M
m=1
n=1 yn log(pn ); where yn , pn are the
ground truth and model’s prediction for the time-step n in
view m. We jointly optimize the framework by minimizing
the sum of the cross entropy loss as well as Eq. 3 and using
the Oracle summary as the ground-truth in the supervised
setting. In the unsupervised setting, view selection weights
are only learned by learning the quality qnm from the MultiDPP measure and we only optimize the Multi-DPP loss criterion Eq. 4.
Lastly, while input views are not required to be temporally aligned, they are assumed to have timestamps. This is
a commonly held assumption in previous multi-view literature (e.g., [8, 18]) due to its default presence in nearly all
343
modern recording devices. If given non-aligned views, our
framework can process any number of views at each timestep since the weights of the Bi-LSTM and the MLPs are
shared among the views.
4.4. Multi-view supervised scalability
Supervised summarization tends to have a superior generalization performance when compared to unsupervised
ones, e.g., [9, 38, 55, 29]. Relying on human-annotated labels allows learning generic behavioral patterns instead of
customized heuristics as in most unsupervised approaches.
Nonetheless, supervision requires an abundance of labeled
training data. Thus, a crucial concern of a multi-view supervised system is to be scalable in order to utilize all available
forms of labels for an improved performance. Obviously,
unsupervised systems do not undergo this challenge since
they do not utilize labels.
In particular, a scalable multi-view video summarizer is
invariant to view order and number-of-views, and therefore
can learn from any data regardless of those properties. First,
invariance to view order implies producing the same summary for input views (vi , vj , vk ) as to (vj , vi , vk ); ∀i, j, k ∈
{1, 2, .., M }, for all possible permutations of (i, j, k). Our
approach satisfies this requirement by constructing jointfeatures via max-pooling. Thus, summary is only shaped
by the most effective features regardless of view order.
The second condition, invariance to number-of-views,
entails the ability to train on data with varying numbersof-views and test on data of any number-of-views. Satisfying this condition requires the number of trainable parameters to be invariant from the number-of-views of the
input. This way the same set of parameters can be used
to train/test on data with any number-of-views. We followed two techniques ensuring a fixed number of trainable
parameters: (a) max pooling view-specific features, and
(b) weight-sharing for Bi-LSTM and view selection layers. Firstly, Applying max-pooling on view-specific features produces a fixed-size joint feature vector that is invariant from the number-of-views in the input. Additionally, choosing the prominent features across views entails
learning intra-view dependencies. Secondly, weight sharing
across Bi-LSTM view-streams and view selection layers ensures our framework has a single set of trainable parameters
for each of those layers regardless number-of-views.
5. Experiments and Results
5.1. Baseline Methods
Since our supervised approach is the first supervised
multi-view summarization method, we could not compare
with other supervised Multi-View approaches. Nonetheless, we compare our criterion with supervised and unsupervised single-view, and unsupervised multi-view summa-
rizations. Additionally, we include Reinforcement Learning
baselines that showed competitive performance on singleview videos.
To apply the single-view configuration on multi-view
videos, we examine two settings:
• Merge-Views: Aggregating views then summarizing
aggregate footage using a single-view summarizer.
Summary is consistent if the views are independent.
• Merge-Summaries: Summarizing each view independently and then aggregating the summaries. Complementary to the former setting, this should result in a
consistent summary if the summaries are independent.
In our experiments, we observed that the supervised version of Convolutional Sequences [40] tends to diverge when
using Merge-summaries method in training due to relatively
short videos in their case. Thus, we compare with the more
reliable version of Merge-views. On the contrary, reinforcement learning methods [57, 58] tend to be unstable for the
merge-views due to the long sequential input where the reward is usually far away from the start of the sequence, and
thus it may lead to vanishing the gradients. So, we compare
with the merge-summary concatenation, where the reward
function tends to be more stable. This observed instability
faced in training the baselines establishes a better motive
for developing an objective like ours that is curated to be
independent of number views, making it tractable during
training/testing when the number of views is large, and at
the same time incorporates the information from all views
while preserving temporal ordering.
5.2. Experimental Setup
We use GoogLeNet [46] features for all the methods
as an input. For a fair comparison, we train all supervised
baselines [12, 55] and Ours with the same experimental
setup: iterations number, batch size, and optimization. We
note that all neural-network models have the same architecture (same number of trainable parameters) and only differ
in the objective function and their training strategy to ensure
a fair comparison.
The supervised frameworks are trained for twenty iterations with a batch size of 10 sequences. Adam optimizer is
used to optimize the losses with a learning rate of 0.001. After each iteration, we calculate the mean validation loss and
only evaluate the model with the best validation loss across
all iterations. We discuss further details of the architecture
and training in the supplementary materials.
As discussed in section 3.1, we categorize our dataset
sequences into six collections to facilitate the training and
evaluation. In our experiments, we follow a round-robin approach to train-validate-test the supervised/semi-supervised
learning frameworks. We use four collections for training,
344
Precision
Two-View
Recall F1-Score
Three-View
Precision Recall F1-Score
Random Baseline
Uniform Sampling
9.83
10.65
9.85
5.83
5.16
5.77
Unsupervised
& Sub-modular
Multi-View
feature selection [31]
joint embedding [34]
Unpaired Data [39]
Sub-modular [12]
17.83
18.37
21.26
19.91
19.15
25.20
22.16
25.21
17.46
20.66
21.81
22.71
12.33
13.88
19.62
18.49
16.28
24.85
19.93
22.71
10.70
17.17
19.41
20.19
Unsupervised
Single-View
Adversarial [29]: Merge-Views
Adversarial [29]: Merge-Summaries
Convolutional [40]: Merge-Views
Convolutional [40]: Merge-Summaries
21.16
20.61
21.05
20.64
23.42
22.05
22.92
22.34
22.35
21.12
22.26
21.87
20.2
19.32
19.86
16.52
18.94
18.24
20.68
20.47
19.76
18.96
20.13
18.91
Ours-unsupervised
Multi-DPP
23.91
24.72
24.18
21.96
22.24
22.61
Supervised
& RL
Single-View
LSTM [55]: Merge-Views
LSTM [55]: Merge-Summaries
Convolutional [40]: Merge-Views
RL Diversity [57]: Merge-Summaries
RL Classification [58]: Merge-Summaries
27.87
26.61
26.84
25.02
26.01
28.57
27.25
26.01
27.00
26.71
27.67
26.43
26.38
25.97
26.27
23.25
22.86
22.28
23.78
22.74
23.87
23.59
23.47
22.14
23.68
22.95
22.76
22.92
23.14
23.37
(Ablation Study)
Ours-supervised
Only Cross-Entropy (CE)
Full: Multi-DPP + CE
27.33
28.58
27.83
29.05
27.13
28.30
21.33
25.06
22.03
25.79
21.10
25.03
Table 1: MultiEgo benchmarking for two-view and three-view settings. Ours consistently outperforms the baselines on all the measures.
We also run an ablation study to show the effect of optimizing the supervised Multi-DPP measure as compared to using only Cross-Entropy.
one for validation, and one for testing across all the 30 different combinations of collections. Since no training is required for unsupervised approaches, we only test methods
on each collection separately and report their means.
To evaluate the summaries produced by all the methods,
we follow the protocols in [29, 55, 15, 43] to compare
the predictions against the oracle summary. We start by
temporally segmenting all views using the KTS algorithm
[38] to non-overlapping intervals. Then, we repetitively extract key-shot based summaries using MAP [54] while setting the threshold of summary length to be 15% of a single
view’s length. For each of the selected shots, we consider
all of its frames to be included in the summary.
5.3. Performance Evaluation
We follow [36, 34, 55, 29, 8] in using f1-score, precision, and recall to evaluate the quality of the produced summaries by comparing frame-level correspondences between
the predicted summary and the ground-truth summary. Table 1 shows the mean precision, recall, and F1-score across
all the combinations of training-validation-testing for both
the two-view setting and three-view setting.
In general, supervised frameworks perform better than
unsupervised ones due to learning from human annotations.
For unsupervised methods, [34, 31, 12, 39] obtain the lowest performance indicating their inability to adapt to visual
changes occurring in egocentric motion due to the lack of
summary labels. However, using adversarial training [29]
seems to improve the results even with a single-view setting
since the learning distribution converges to true data distri-
bution, and it better learns to isolate egocentric-noise. Similarly, the supervised single-view BiLSTM [55] and Convolutional Sequences [40] reasonably adapt to egocentric
visual noise utilizing the summary labels. Only our model
monitors the egocentric-introduced noise and process data
in a proper temporal order, achieving the best performance
in both unsupervised and supervised comparisons.
To study the impact of enforcing diversity, we run an
ablation study by evaluating our supervised approach with
only optimizing cross-entropy loss(Ours: Cross-Entropy
(CE) in Table 1). This corresponds to training our model
by only selecting representative views, without explicitly
enforcing diversity. Evidently, adding Multi-DPP measure
to the CE loss improves the results, especially in the threeview setting due to the increase of input footage required to
diversify. It is worth noting that using only Multi-DPP is
equivalent to our unsupervised version.
Generally, it can be noticed that performance in the twoview setting is higher than that in the three-view setting,
although methods’ ranking remains the same. This is because of the increase in problem complexity when considering more views to be summarized, causing the performance
to drop. Additionally, the performance gap increases as we
move from two-view to three-view setting. Theoretically,
we expect approaches such as [40, 55, 29, 57] drop performance as the number of views grows and this is backed
up empirically. Secondly, whether we concatenate views or
concatenate summaries in order to adapt [55, 57, 40, 29],
the complexity of the adaptation is unnecessarily high (either a larger DPP kernel in case of view concatenation and
345
Campus
Lobby
Graph [37]
RandomWalk [8]
RoughSets [23]
BipartiteOPF [18]
Unpaired Data [39]
Joint embedding [34]
Convolutional [40]-Unsup
Convolutional [40]-Sup
RL Diversity [57]
RL Classification [58]
Ours-unsupervised
Ours-supervised
41.3
75.8
75.8
81.8
91.0
89.4
90.2
94.0
92.9
92.1
90.7
94.2
49.1
61.6
62.1
71.8
80.5
77.8
78.6
81.9
80.6
82.5
81.2
86.1
73.4
86.8
84.2
88.2
89.3
92.5
92.5
93.0
91.4
92.2
92.7
93.4
Table 2: Fixed-cameras multi-view f1-scores. We train our supervised model on Multi-Ego and test it on three datasets.
processing each view separately in summary concatenation
scenario). Our proposed approach uses a maxpool operation as well as view quality multiplication to represent all
views while preserving computational/memory efficiency.
Finally, we investigate the performance of our approach
on fixed-cameras multi-view setting, which is a special case
of our generic configuration. We evaluate our model on
three standard fixed-cameras multi-view benchmarks: Office, Campus, and Lobby datasets [8, 32]. We train our
supervised model on our Multi-Ego dataset, and evaluate
it on the testing dataset. Table 2 shows a substantial success in transferring the learning from one domain (egocentric multi-view) to another domain (static multi-view) without the need to specifically-tailored training data. Thus, we
provide the first supervised multi-view summarization that
significantly outperforms state-of-the-art unsupervised approaches while only being trained on our data. Additionally, our unsupervised model outperforms them due to explicitly enforcing diversity and quality constraint. The consistent advantage in the three experimental environments
for both our supervised and unsupervised models demonstrates the versatility of the proposed approach in handling
static/egocentric videos in a generic summarization setting.
5.4. Supervised Scalability Analysis
In this section, we study our supervised framework’s capability to learn from a varying number-of-views in a sequence by verifying if the training process can exploit any
increase in data regardless of its numbers-of-views. We
start by splitting our data into two categories of nearly the
same number of sequences: (a) three-view (Collections:
Indoors-Outdoors, SeaWorld, Supermarket), and (b) twoview (Collections: Car-Ride, College-Tour, Library). We
investigate the performance of three train/test configurations where testing data is limited to a single category:
1. Same category training (2×two-view& 1×two-view):
Test
two-view
Office
three-view
Method
Train
Precision
Recall
F1-Score
2×two-view
3×three-view
29.83
29.77
29.77
30.30
29.67
30.2
2×two-view +
3×three-view
34.37
35.03
34.33
2×three-view
2×two-view
18.53
18.23
18.80
18.27
18.33
17.67
3×two-view +
2×three-view
21.53
21.87
21.33
Table 3: Scalability Analysis: Our framework can be trained and
tested on data of different number-of-views.
Train on 2 collections from same category as testing.
2. Different category training (3×two-view& 3×threeview): Train on 3 collections from one category, and then
test it on a collection belonging to a different category.
3. Training using Data from the two categories (3×twoview + 2×two-view& 2×two-view + 3×two-view): Train
on data from different categories, and test it on a collection
from one of the categories in the training data.
As shown in Table 3, training our framework on same
categories or different categories obtain comparable results
when testing on both two-view and three-view settings.
However, increasing training data size by combining both
categories significantly improves the results. This shows
that our model can be trained and tested on data of various
number-of-views and also is able take advantage of any data
increase with no regard to its number-of-views setting.
6. Conclusion
In this work, we proposed the problem of multi-view
video summarization for dynamically moving cameras that
often do not share the same field-of-view. Our formulation
provides the first supervised solution to multi-stream summarization in addition to an unsupervised adaptation. Unlike previous work in multi-view video summarization, we
presented a generic approach that can be trained in a supervised or unsupervised setting to generate a comprehensive summary for all views with no prior assumptions on
camera placement nor labels. It identifies important events
across all views and selects the view(s) best illustrating each
event. We also introduced a new dataset, recorded in uncontrolled environments including a variety of real-life activities. When evaluating our approach on the collected
benchmark and additional three standard mutli-view benchmark datasets, our framework outperformed all baselines of
state-of-the-art supervised, reinforcement and unsupervised
single- and multi-view summarization methods.
346
References
[1] Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins,
and Ariel Shamir. Automatic editing of footage from multiple social cameras. ACM Transactions on Graphics (TOG),
33(4):81, 2014.
[2] Barak Ariel, William A Farrar, and Alex Sutherland. The
effect of police body-worn cameras on use of force and citizens’ complaints against the police: A randomized controlled trial. Journal of quantitative criminology, 31(3):509–
535, 2015.
[3] Chen. Video to text summary: Joint video summarization
and captioning with recurrent neural networks. In BMVC,
pages 1–10, 2017.
[4] Mohamed Elfeki and Ali Borji. Video summarization via
actionness ranking. Winter Applications in Computer Vision
(WACV), 2019.
[5] Mohamed Elfeki, Camille Couprie, Morgane Riviere, and
Mohamed Elhoseiny. Gdpp: Learning diverse generations using determinantal point process. arXiv preprint
arXiv:1812.00068, 2018.
[6] Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Kumar
Singh, Yong Jae Lee, David J Crandall, and Michael S
Ryoo. Identifying first-person camera wearers in thirdperson videos. arXiv preprint arXiv:1704.06340, 2017.
[7] Yachuang Feng, Yuan Yuan, and Xiaoqiang Lu. Learning
deep event models for crowd anomaly detection. Neurocomputing, 219:548–556, 2017.
[8] Yanwei Fu, Yanwen Guo, Yanshu Zhu, Feng Liu, Chuanming Song, and Zhi-Hua Zhou. Multi-view video summarization. IEEE Transactions on Multimedia, 12(7):717–729,
2010.
[9] Boqing Gong, Wei-Lun Chao, Kristen Grauman, and Fei
Sha. Diverse sequential subset selection for supervised video
summarization. In Z. Ghahramani, M. Welling, C. Cortes,
N. D. Lawrence, and K. Q. Weinberger, editors, Advances
in Neural Information Processing Systems 27, pages 2069–
2077. Curran Associates, Inc., 2014.
[10] Swati Gupta. 1 determinantal point processes.
[11] Michael Gygli, Helmut Grabner, Hayko Riemenschneider,
and Luc Van Gool. Creating summaries from user videos.
In European conference on computer vision, pages 505–520.
Springer, 2014.
[12] Michael Gygli, Helmut Grabner, and Luc Van Gool. Video
summarization by learning submodular mixtures of objectives. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 3090–3098, 2015.
[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
memory. Neural computation, 9(8):1735–1780, 1997.
[14] Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang,
and Qi Tian. Coherent semantic-visual indexing for largescale image retrieval in the cloud. IEEE Transactions on
Image Processing, 2017.
[15] Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li.
Video summarization with attention-based encoder-decoder
networks. arXiv preprint arXiv:1708.09545, 2017.
[16] Zhong Ji, Yuanyuan Zhang, Yanwei Pang, and Xuelong Li.
Hypergraph dominant set based multi-video summarization.
Signal Processing, 148:114–123, 2018.
[17] Zhong Ji, Yuanyuan Zhang, Yanwei Pang, Xuelong Li, and
Jing Pan. Multi-video summarization with query-dependent
weighted archetypal analysis. Neurocomputing, 332:406–
416, 2019.
[18] Sanjay K Kuanar, Kunal B Ranga, and Ananda S Chowdhury. Multi-view video summarization using bipartite matching constrained optimum-path forest clustering. IEEE Transactions on Multimedia, 17(8):1166–1173, 2015.
[19] Alex Kulesza and Ben Taskar. Structured determinantal point
processes. In NIPS, 2010.
[20] Alex Kulesza and Ben Taskar. Learning determinantal point
processes. 2011.
[21] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends R in
Machine Learning, 5(2–3):123–286, 2012.
[22] Yong Jae Lee and Kristen Grauman. Predicting important
objects for egocentric video summarization. International
Journal of Computer Vision, 114(1):38–55, 2015.
[23] Ping Li, Yanwen Guo, and Hanqiu Sun. Multi-keyframe abstraction from videos. In 2011 18th IEEE International Conference on Image Processing, pages 2473–2476. IEEE, 2011.
[24] Yingbo Li and Bernard Merialdo. Multi-video summarization based on video-mmr. In 11th International Workshop on
Image Analysis for Multimedia Interactive Services WIAMIS
10, pages 1–4. IEEE, 2010.
[25] Yandong Li, Liqiang Wang, Tianbao Yang, and Boqing
Gong. How local is the local diversity? reinforcing sequential determinantal point processes with dynamic ground sets
for supervised video summarization. In Proceedings of the
European Conference on Computer Vision (ECCV), pages
151–167, 2018.
[26] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang.
Spatio-temporal lstm with trust gates for 3d human action
recognition. In European Conference on Computer Vision,
pages 816–833. Springer, 2016.
[27] Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. A
user attention model for video summarization. In Proceedings of the tenth ACM international conference on Multimedia, pages 533–542. ACM, 2002.
[28] Odile Macchi. The coincidence approach to stochastic point
processes. Advances in Applied Probability, 7(1):83–122,
1975.
[29] Behrooz Mahasseni, Michael Lam, and Sinisa Todorovic.
Unsupervised video summarization with adversarial lstm
networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pages 1–10, 2017.
[30] Jingjing Meng, Suchen Wang, Hongxing Wang, Junsong
Yuan, and Yap-Peng Tan. Video summarization via multiview representative selection. In Proceedings of the IEEE
International Conference on Computer Vision Workshops,
pages 1189–1198, 2017.
[31] Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. Efficient and robust feature selection via joint l2, 1-norms minimization. In Advances in neural information processing systems, pages 1813–1821, 2010.
347
[32] Shun-Hsing Ou, Chia-Han Lee, V Srinivasa Somayazulu,
Yen-Kuang Chen, and Shao-Yi Chien. On-line multiview video summarization for wireless video sensor network. IEEE Journal of Selected Topics in Signal Processing,
9(1):165–179, 2015.
[33] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent neural encoder for video
representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 1029–1038, 2016.
[34] Rameswar Panda and Amit Roy Chowdhury. Multi-view
surveillance video summarization via joint embedding and
sparse optimization. IEEE Transactions on Multimedia,
2017.
[35] Rameswar Panda, Abir Dasy, and Amit K Roy-Chowdhury.
Video summarization in a multi-view camera network. In
Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 2971–2976. IEEE, 2016.
[36] Rameswar Panda, Niluthpol Chowdhury Mithun, and Amit
Roy-Chowdhury. Diversity-aware multi-video summarization. IEEE Transactions on Image Processing, 2017.
[37] Yuxin Peng and Chong-Wah Ngo. Clip-based similarity
measure for query-dependent clip retrieval and video summarization. IEEE Transactions on Circuits and Systems for
Video Technology, 16(5):612–627, 2006.
[38] Danila Potapov, Matthijs Douze, Zaid Harchaoui, and
Cordelia Schmid. Category-specific video summarization.
In European conference on computer vision, pages 540–555.
Springer, 2014.
[39] Mrigank Rochan and Yang Wang. Video summarization by
learning from unpaired data. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 7902–7911, 2019.
[40] Mrigank Rochan, Linwei Ye, and Yang Wang. Video summarization using fully convolutional sequence networks. In
Proceedings of the European Conference on Computer Vision (ECCV), pages 347–363, 2018.
[41] Aidean Sharghi, Ali Borji, Chengtao Li, Tianbao Yang, and
Boqing Gong. Improving sequential determinantal point processes for supervised video summarization. In Proceedings
of the European Conference on Computer Vision (ECCV),
pages 517–533, 2018.
[42] Aidean Sharghi, Jacob S Laurel, and Boqing Gong. Queryfocused video summarization: Dataset, evaluation, and a
memory network based approach. Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
2017.
[43] Yale Song, Jordi Vallmitjana, Amanda Stent, and Alejandro Jaimes. Tvsum: Summarizing web videos using titles.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5179–5187, 2015.
[44] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using
lstms. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 843–852, 2015.
[45] Jay Stanley. Police body-mounted cameras: With right policies in place, a win for all. New York: ACLU, 2013.
[46] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
Vanhoucke, Andrew Rabinovich, et al. Going deeper with
convolutions. Cvpr, 2015.
[47] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko.
Sequence to sequence-video to text. In Proceedings of the
IEEE international conference on computer vision, pages
4534–4542, 2015.
[48] Anatoly Vershik. Asymptotic Combinatorics with Applications to Mathematical Physics: A European Mathematical
Summer School held at the Euler Institute, St. Petersburg,
Russia, July 9-20, 2001. Springer, 2003.
[49] Feng Wang and Bernard Merialdo. Multi-document video
summarization. In 2009 IEEE International Conference on
Multimedia and Expo, pages 1326–1329. IEEE, 2009.
[50] Erkun Yang, Cheng Deng, Wei Liu, Xianglong Liu, Dacheng
Tao, and Xinbo Gao. Pairwise relationship guided deep hashing for cross-modal retrieval. In AAAI, pages 1618–1625,
2017.
[51] Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf,
Minyi Guo, and Baining Guo. Unsupervised extraction of
video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer
Vision, pages 4633–4641, 2015.
[52] Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf,
Minyi Guo, and Baining Guo. Unsupervised extraction of
video highlights via robust recurrent auto-encoders. In Proceedings of the IEEE International Conference on Computer
Vision, pages 4633–4641, 2015.
[53] Ryo Yonetani, Kris M Kitani, and Yoichi Sato. Ego-surfing
first-person videos. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 5445–
5454, 2015.
[54] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Summary transfer: Exemplar-based subset selection
for video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
1059–1067, 2016.
[55] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman.
Video summarization with long short-term memory. In European Conference on Computer Vision, pages 766–782.
Springer, 2016.
[56] Bin Zhao, Xuelong Li, and Xiaoqiang Lu. Hierarchical recurrent neural network for video summarization. In Proceedings of the 2017 ACM on Multimedia Conference, pages
863–871. ACM, 2017.
[57] Kaiyang Zhou, Yu Qiao, and Tao Xiang. Deep reinforcement learning for unsupervised video summarization with
diversity-representativeness reward. In Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[58] Kaiyang Zhou, Tao Xiang, and Andrea Cavallaro. Video
summarisation by classification with deep reinforcement
learning. The Thirty-Second AAAI Conference on Artificial
Intelligence (AAAI-18), 2018.
[59] Wentao Zhu, Cuiling Lan, Junliang Xing, Wenjun Zeng,
Yanghao Li, Li Shen, Xiaohui Xie, et al. Co-occurrence
348
feature learning for skeleton based action recognition using
regularized deep lstm networks. In AAAI, volume 2, page 8,
2016.
349