Dynamic textures (DT) are videos of non-rigid dynamical objects, such as fire and waves, which constantly change their shape and appearance over time. Most of the prior work on DT analysis dealt with the classification of videos of a single DT or the segmentation of videos containing multiple DTs. We consider the problem of joint segmentation and categorization of videos of multiple DTs under varying viewpoint, scale, and illumination conditions. We formulate this problem of assigning a class label to each pixel in the video as the minimization of an energy functional composed of two terms. The first term measures the cost of assigning a DT category to each pixel. For this purpose, we introduce a bag of dynamic appearance features (BoDAF) approach, in which we fit each video with a linear dynamical system (LDS) and use features extracted from the parameters of the LDS for classification. This BoDAF approach can be applied to the whole video, thus providing a framework for classifying videos of a single DT, or to image patches (superpixels), thus providing the cost of assigning a DT category to each pixel. The second term is a spatial regularization cost that encourages nearby pixels to have the same label. The minimization of this energy functional is carried out using the random walker algorithm. Experiments on existing databases of a single DT demonstrate the superiority of our BoDAF approach with respect to state-of-the art methods. To the best of our knowledge, the problem of joint segmentation and categorization of videos of multiple DTs has not been addressed before, hence there is no standard database to test our method. We therefore introduce a new database of videos annotated at the pixel level and evaluate our approach on this database with promising results.
Existing dynamic texture databases are not well-suited for testing our joint segmentation and categorization algorithm. This is because most of the video sequences in these databases either contain only a single texture and seldom any background However, the Dyntex dataset is well-suited for our purposes. This dataset consists of 656 video sequences, without annotation at the pixel level. We used the 3 largest classes we could obtain from this dataset: waves, flags and fountains. This gave us 117 video sequences, which we manually annotated at the pixel level. Sample annotation from the dataset are show below
The annotation can be downloaded from the Vision Lab @ JHU data repository (here) which requires user registration. If you use the annotation in your publication, we would be grateful if you could cite the paper mentioned below. The mat file contains a structure that has 4 fields. The filename is the original sequence from the Dyntex database, the thumbnail field contains a single frame of the video sequence while the annotation feild contains the annotation for the 4 classes. The classes are 1-Background, 2-Waves, 3-Flags and 4-fountain. A big thanks goes out to Daniele Perrone of the Vision Lab at HWU for all his help with the annotations.