Int J Multimed Info Retr (2017) 6:233–249 DOI 10.1007/s13735-017-0126-y REGULAR PAPER Unsupervised group feature selection for media classification Maia Zaharieva1 · Christian Breiteneder1 · Marcus Hudec2 Received: 16 March 2017 / Revised: 2 May 2017 / Accepted: 8 May 2017 / Published online: 25 May 2017 © The Author(s) 2017. This article is an open access publication Abstract The selection of an appropriate feature set is cru- cial for the efficient analysis of any media collection. In general, feature selection strongly depends on the data and commonly requires expert knowledge and previous experi- ments in related application scenarios. Current unsupervised feature selection methods usually ignore existing relation- ships among components of multi-dimensional features (group features) and operate on single feature components. In most applications, features carry little semantics. Thus, it is less relevant if a feature set consists of complete fea- tures or a selection of single feature components. However, in some domains, such as content-based audio retrieval, features are designed in a way that they, as a whole, have consider- able semantic meaning. The disruption of a group feature in such application scenarios impedes the interpretability of the results. In this paper, we propose an unsupervised group feature selection algorithm based on canonical correlation analysis (CCA). Experiments with different audio and video classification scenarios demonstrate the outstanding perfor- mance of the proposed approach and its robustness across different datasets. Keywords Feature selection · CCA · Audio classification · Video classification B Maia Zaharieva maia.zaharieva@tuwien.ac.at Christian Breiteneder christian.breiteneder@tuwien.ac.at Marcus Hudec marcus.hudec@univie.ac.at 1 TU Wien (TUW), Interactive Media Systems, Vienna, Austria 2 University of Vienna, Data Analytics and Computing, Vienna, Austria 1 Introduction The selection of an appropriate feature set is a crucial step in the development of any approach for content-based media classification. The decision commonly depends on the specific characteristics of the underlying data, the tar- get application scenario, and the application domain. In the context of media classification, usually a large number of features are employed to cope with the simple and usually rather syntactic nature of individual features. This is based on the assumption that different features describe different and complementary (orthogonal) qualities of the underlying data. Such feature sets commonly include features that are dependent on each other and features that are not at all rel- evant, which induces unnecessary computational costs and may lead to overfitting. Additionally, the high dimension may cause further problems, usually referred to as the curse of dimensionality. Numerous methods have been developed to reduce the dimension and at the same time to reveal the same degree of class discrimination. Such methods can be grouped into dimension reduction and feature subset selection approaches. The most widely used linear dimension reduction method is principal component analysis (PCA), which in general transforms a set of highly correlated features into a smaller set of linearly uncorrelated variables called principal com- ponents. Usually, these components cannot be directly or easily interpreted. Feature selection tries to identify the most important and, additionally, less redundant features from a set of potential features. Existing approaches fall into four main categories based on the evaluation criterion applied: embedded, wrapper, filter, and hybrid methods [28,54]. In our approach, we employ a filter-based method, which means that the properties of the data are used as a criterion for the 123 234 Int J Multimed Info Retr (2017) 6:233–249 feature selection and the selected features can be passed on to any classifier. In this paper, we introduce a group feature selection approach based on canonical correlation analysis (CCA). In general, CCA coefficients indicate the correlation between two features of varying dimensionality [20]. Large coeffi- cients signify high correlations and, therefore, such coef- ficients can be indicators for redundant features. On the opposite end of the scale, low-correlated features provide additional information which can improve the descriptive power of the feature set. In previous work, we explored the correlation-based ranking of feature pairs in order to per- form feature selection [45,46]. In contrast to our previous approach, in this work we investigate each feature separately with respect to the currently selected feature set, improving both the quality and the classification performance of the selected features. In general, differences in the objectives of a feature selec- tion process can be found in related work [7]. Improving the performance (efficiency) and providing faster and more cost- effective (effectiveness) predictors are commonly addressed goals. Similar to some other authors [15], we add a third goal that we consider central to our approach: a better under- standing of the underlying data. Most approaches for feature selection and dimensionality reduction do not differentiate whether or not a feature consists of several components. These methods usually do not aim at selecting entire features, but aim at the selection of components only. Features selected are then (linear) combinations of input features which makes the interpretation of data more complex. In the presence of multi-dimensional features that carry significant seman- tics, we therefore require input feature components to be kept together. We use the term group feature to emphasize this fact. The consideration of group features offers various advantages: First and most important, features consisting of several components and especially, when carrying significant meaning, should not be torn apart. Second, group features are less at risk of overfitting and show less sensitivity to differ- ent datasets and application scenarios. Third, the approach is more cost-effective, since group features are usually com- puted at once. Most of the existing feature selection approaches aim at the optimization for a specific application scenario. In such a context, feature selection is usually based on previ- ous experiments in related application scenarios. The major drawback of such a strategy is the use of prior information about the content, such as the existence of different con- tent elements for audio-based classification (e.g., speech, music) that may lead to a bias in the selection process. It seems to be general knowledge that in some cases even a slight alteration of the application scenario, such as the consideration of an additional genre for a genre recognition approach, may lead to questioning the appropriateness of the previously employed features. Potentially, this results in the (partly resource-demanding) calculation of features that may not even contribute to the classification performance of the underlying approach. Therefore, we add as a fourth and last objective for our feature selection approach robustness (stability). Robustness measures the number of commonly selected features in dif- ferent runs or experiments. For environments with slightly changing scenarios or frequently changing data, it is of emi- nent importance that features are robust enough to cope with these changes by still achieving good retrieval performances. This paper includes the following contributions: 1. We propose an improved version of the CCA-based feature selection method presented in [45,46] that essen- tially compares CCA values between a single feature candidate and the set of already selected features. The approach is unsupervised and maintains the interpretation of entire features as the underlying feature components are kept together. 2. We introduce robustness as a goal for feature selection in the case of changing media repositories and define a measure for robustness. We argue and demonstrate that many of the excellent results in media feature selec- tion are clearly tailored to specific datasets and results drop immediately when the methods are confronted with larger variations in the data. The measure for robustness allows to compare different approaches in this respect. 3. We perform a thorough evaluation of the proposed approach on datasets from the audio and video domains with different application scenarios. Evaluation results demonstrate the outstanding performance of the approach and its robustness across different datasets. Our unsu- pervised approach is competitive to state of the art approaches that stress the importance of the “right” feature set and that, therefore, require careful manual selection of their features. 4. We question the appropriateness of application-oriented approaches at least for many tasks in the audio domain, specifically the ones investigated here, and favor a robust data-oriented approach for feature selection. We under- stand that the quality and robustness of features selected by our approach is primarily caused by the powerful fea- tures developed in the audio domain so far. We are aware that more evaluations for different datasets and different application scenarios will be necessary to investigate this issue further, but we believe that the method suggested will have an impact on how features will be selected in the audio domain in the future. This paper is organized as follows. Section 2 presents an overview of related work. Section 3 presents proposed CCA-based feature selection approach. Section 4 outlines 123 Int J Multimed Info Retr (2017) 6:233–249 235 the evaluation setup for the performed experiments. Sec- tion 5 presents the results of an experimental study of the proposed approach and a comparison with representative methods. Section 6 concludes this work. 2 Related work Feature selection is commonly applied to reduce the dimen- sionality of exploited data, to remove irrelevant items, to increase accuracy, and to improve result comprehensibil- ity [15]. Feature selection approaches can be broadly divided into embedded, wrapper, filter, and hybrid methods [28,54]. Out of these four categories, filter approaches are widely employed for their efficiency and ability to generalize since they are not bound to the bias of any learning algorithm. Filter methods follow mainly two approaches for feature selection: individual feature evaluation and feature subset evaluation. Individual feature evaluation methods assess each feature component individually and assign weights according to their relevance, resulting in a ranked list of features [17,29]. Subset evaluation approaches assess fea- tures in the context of generated feature subsets [34,61]. Such methods iteratively generate feature subsets using different search strategies until a predefined stopping criterion is met. The evaluation of generated feature subsets usually employs some statistical measures to assess their relevance and redun- dancy. In this context, correlation analysis is often applied to measure feature redundancy. In our previous work, we proposed a CCA-based approach, that involves information gain as a feature selection rule [46]. We performed a thor- ough comparison with well-established filter-based selection methods, Chi-square [33], information gain [17], informa- tion gain ratio [42], ReliefF [29], consistency-based [34], and correlation-based [16]. Our approach outperformed the investigated algorithms in terms of accuracy and runtime and obtained a high robustness of the feature selections for dif- ferent datasets in the same audio retrieval task. Both subset and individual feature evaluation approaches ignore any existing relationships among the components of multi-dimensional features. Recently, Xu et al. [57] pro- posed a gradient boosted feature selection approach that is able to consider predefined group feature structures. How- ever, the proposed approach employs the group structure information for controlled boosting only and, thus, com- ponents from the same multi-dimensional feature are, in general, preferred for selection. As a result, the final fea- ture set still consists of single feature components (from potentially few multi-dimensional features). In contrast, the group lasso algorithm [62] is an extension of the popular lasso approach [55] for the selection of predefined groups of vari- ables in regression models [35]. Although the estimates are invariant under (groupwise) orthogonal transformations [62], fitted models may not be sparse. Simon et al. propose sparse group lasso that yields solutions that are sparse at both group and individual feature levels [13,52]. However, Hall and Miller show that such model-based approaches may still be inadequate for detecting all influential features [18]. Current research on audio and video classifications focuses mostly on the development of new features and classifica- tion methods [11,26,27,30,38]. Few works exploit feature selection in the context of genre classifications [9,12,22, 50,60], musical instrument classification [8,51], and emo- tion/mood classification [43,59], commonly using a single dataset only. Most approaches consider wrapper-based fea- ture selection, i.e., the underlying feature selection process is supervised, maximizing the classification accuracy of a learn- ing algorithm. Common approaches for the identification of potential feature sets include genetic algorithm [50] and greedy selection heuristics [12,43,59,60]. Although, usually, wrapper-based feature selection methods outperform filter- based approaches, they commonly suffer potential overfitting to the training data and, hence, decreased generalization abil- ity. Existing filter-based feature selection approaches in the context of media classification employ primarily individual feature evaluation to assess the quality of each feature com- ponent. For example, Simmermacher et al. [51] and Deng et al. [8] exploit the performance of information gain, infor- mation gain ratio, and symmetrical uncertainty for musical instrument classification. The authors report that the informa- tion gain achieves comparable performance to the other two filter-based methods. Similarly, Doraisamy et al. [9] evalu- ate information gain ratio, Chi-square, correlation-based, and SVM-based feature selection in the context of genre clas- sification of traditional Malay music. The achieved results indicate similar performance of the considered feature selec- tion approaches on the employed dataset. Recently, Huang et al. [22] proposed a self-adaptive harmony search algorithm for the classification of music genres. The authors employ a local feature selection strategy on pairs of music genres which limits the applicability of the approach to scenarios with available prior knowledge about the investigated dataset. Currently, existing approaches and reported evaluations on audio and video classifications are commonly limited to a specific task and a single dataset. In contrast, we explore various application scenarios and provide insights into the selected features using different audio and video datasets. Additionally, we investigate the role of the data and the appli- cation scenario on the selected features. 3 Proposed approach In this paper, we propose an unsupervised group feature selection approach which is an extended version of [45,46]. The approach exploits the canonical correlations between 123 236 Int J Multimed Info Retr (2017) 6:233–249 features (of potentially strongly varying dimensionality) to estimate the relevance of every single feature. The assump- tion is that low-correlated features provide additional or complementary information, while highly correlated features indicate redundant information. Therefore, the inclusion of the latter would only increase the target dimensionality with- out improving the descriptive power of the resulting feature set. CCA is a multivariate multiple regression method that measures correlations between multi-dimensional vectors (features) with potentially different number of variables (fea- ture components) [20]. For each of the input vectors, base vectors are generated in a way that optimizes their correla- tions in terms of mutual information, which serves as an indicator for statistical coherence. The base vectors have maximum dimensions of the minimum dimensionality of the original input vectors and are independent of any affine trans- formations or the underlying coordinate system. Therefore, CCA allows for the analysis of relationships between multi- ple dependent or independent variables. For the special case of low-dimensional features, i.e., features that comprise one or two components, we employ the linear and multiple cor- relation approaches by Hotteling [20]. Let X and Y be two features of (potentially) different sizes: X = [X1, . . . , X p], Y = [Y1, . . . , Yq ]. For each feature, there is a solution for the corresponding linear combinations U = Xα and V = Yβ. To optimize the correlation coefficient ρ between U and V , the vectors α and β have to be found such that ρ is maximized, i.e., maxα,β ρ(U, V ), where ρ(U, V ) is given by: ρ(U, V ) = Cov(U, V )√ V ar(U ) √ V ar(V ) = Cov(Xα, Yβ)√ V ar(Xα) √ V ar(Yβ) = α ′Cov(X, Y )β√ α′V ar(X)α √ β ′V ar(Y )β . (1) We consider the correlation coefficient, ρ, between any two features X and Y as an estimator for their complemen- tarity and, hence, relevance. Initially, we rank all features according to their pairwise correlation coefficients. A high rank is assigned to weakly correlated and thus complemen- tary feature pairs. Strongly correlated and thus redundant feature pairs get low ranks. The feature pair with the low- est correlation builds the initial target feature set. All other features are processed in a loop. In each iteration, an inter- nal evaluation of the correlations of the remaining features to the current target set estimates whether or not a fea- ture contributes additional information to the target set. For this purpose, we consider the current target set as a sin- gle multi-dimensional feature and exploit its correlation to every feature in the remaining group feature set. The fea- ture with the lowest correlation is added to the target set if it does not exceed a predefined threshold, thc. The pur- pose of the threshold is to avoid the consideration of features that have a high correlation value. Each addition to the tar- get set influences the correlations to the remaining features. The more features are included in the target set, the higher the correlations of the target set to the remaining features becomes. Therefore, the underlying CCA is re-initialized after each feature addition. In the original approach, we only computed CCA values once and processed (added) the corresponding features based on this static list. The adaptation of CCA has considerable consequences on the selected features. We will show in the evaluation that the selected features demonstrate extraordinary quality. They will be used throughout the paper for different datasets as well as different tasks and compete with feature sets that were specifically and carefully selected for these datasets and/or tasks in many cases. The increase in the correla- tions additionally allows for the autonomous termination of the feature selection process. If the lowest correlation between the target set and the remaining features exceeds the predefined threshold, thc, the selection process termi- nates. Eventually, an additional, optional stopping criterion can be employed to terminate the feature selection process if, for example, a feature set of a certain size is desired. In this work, we do not employ such stopping criterion but inves- tigate all feature pairs to autonomously identify the optimal set of features for the given data. Algorithm 1 illustrates a simplified scheme for the proposed group feature selection approach. Algorithm 1: Group Feature Selection Scheme Input : fS ... input feature set thc ... correlation threshold [maxF... max. number of features] /* optional */ Output: sF... selected feature set tS ←− {} /* target feature set */ rF ←− fS /* remaining feature set */ begin [cP, cC] ←− sort(cca(fS)) tS ←− cP1 /* the feature pair cP1 has the lowest canonical correlation coefficient cC1 */ rF ←− rF − tS while rF = ∅ do [cP, cC] ←− sort(cca(tS ∪ rF)) if cC1