Elsevier

Information Sciences

Volume 494, August 2019, Pages 278-293
Information Sciences

Multi-view cluster analysis with incomplete data to understand treatment effects

https://doi.org/10.1016/j.ins.2019.04.039Get rights and content

Abstract

Multi-view cluster analysis, as a popular granular computing method, aims to partition sample subjects into consistent clusters across different views in which the subjects are characterized. Frequently, data entries can be missing from some of the views. The latest multi-view co-clustering methods cannot effectively deal with incomplete data, especially when there are mixed patterns of missing values. We propose an enhanced formulation for a family of multi-view co-clustering methods to cope with the missing data problem by introducing an indicator matrix whose elements indicate which data entries are observed and assessing cluster validity only on observed entries. In comparison with common methods that impute missing data in order to use regular multi-view analytics, our approach is less sensitive to imputation uncertainty. In comparison with other state-of-the-art multi-view incomplete clustering methods, our approach is sensible in the cases of either missing any entry in a view or missing the entire view. We first validated the proposed strategy in simulations, and then applied it to a treatment study of opioid dependence which would have been impossible with previous methods due to a number of missing-data patterns. Patients in the treatment study were naturally assessed in different feature spaces such as in the pre-, during- and post-treatment time windows. Our algorithm was able to identify subgroups where patients in each group showed similarities in all of the three time windows, thus leading to the identification of pre-treatment (baseline) features predictive of post-treatment outcomes. We found that cue-induced heroin craving predicts adherence to XR-NTX therapy. This finding is consistent with the clinical literature, serving to validate our approach.

Introduction

Granular computing, as defined by Bargiela and Pedrycz in [3], is a computational principle for effectively using granules in data such as subsets or groups of samples, or intervals of parameters to build an efficient computational model for complex systems with massive quantities of data, information and knowledge. It provides an umbrella to cover any theories, methodologies, techniques, and tools that make use of granules - components or subspaces of a space - in problem solving [42]. It can consist of a structured combination of algorithmic abstraction of data and non-algorithmic, empirical verification of the semantics of these abstraction [3], [43]. Cluster analysis is such an important technique aiming to identify subgroups in a population so that subjects in the same group are more similar to each other than to those in other groups. It has been extensively used in computer vision [29], [45], natural language processing [6], [7], [24] and bioinformatics [21], [26]. In this paper, we propose a method to identify the cluster granules in a patient population to analyze treatment study data where missing values occur. In particular, we take into account the nature of the treatment studies, i.e., multiple views of input variables with incomplete data to model treatment effects.

Multi-view data exist in many real-world applications. For instance, a web page can be represented by words on the page or by all the hyper-links pointing to it from other pages. Similarly, an image can be represented by the visual features extracted from it or by the text describing it. Multi-view data analytics aims to make the full use of the multiple views of data, and has attracted wide interests in recent years such as in those works of semi-supervised learning with unlabeled data [2], [4], [11], or unsupervised multi-view data analytics [8], [9], [10], [20], [35]. In this paper, we focus on the unsupervised multi-view clustering methods [14], [15], [18], [28], [41], [44], specifically multi-view co-clustering  [33], [34], [35]. Consider a dataset in which data matrices have rows representing subjects and columns representing features. They share the same set of subjects but each matrix has a different set of features. Multi-view co-clustering is a technique to cluster the rows (subjects) consistently across multiple data matrices (sets of features). A family of such methods  [33], [34], [35] can find subspaces in each different view (rather than using all features in each view) to group subjects consistently across the views. However, existing multi-view co-clustering methods cannot deal with incomplete datasets. Subjects with missing values often need to be removed or imputation has to be done before clustering. Eliminating data weakens the results by reducing the sample size. On the other hand, imputation may bring a separate layer of uncertainty, especially when some data are missing at random but others are not.

The issue of missing value is common in real-world applications. Data may be missing at random or due to selection bias. For example, in the study of an asthma education intervention [27], some missing values were caused by the participants who forgot to visit the school clinic to fill out the form; some were caused by the students whose asthma was too serious to visit the school clinic to report. The former values are missing at random and the latter are not. According to different reasons, the strategies to handle missing values are different. If the data are missing at random, researchers either use only the samples with complete variables [40] or impute the missing values [12] from the available data; if data are missing systematically, there can be a variety of difficulties for researchers to recognize and capture the missing patterns.

In longitudinal studies [16], the missing patterns are very complicated and difficult to deal with. A prospective treatment study usually begins with a baseline assessment and follows up through time, and missing values are commonly encountered because study subjects may not be available at all time points. Just as in our heroin dependence treatment study, both random and non random missing values exist. Because of the mixed missing value patterns, we choose a simple yet effective strategy to handle this problem: introducing an indicator matrix to indicate which feature is observed for which subjects and then omitting the calculation of the loss ocurring on missing locations while clustering. Since the missing values is unknown, imputation cannot guarantee the right values. Ignoring the loss in the missing locations should be a better choice.

In multi-view data, if there are many missing values in different views, then it is useful but challenging to make the different views compensate each other on the missing information to obtain consistent subject grouping. The most recent multi-view co-clustering methods cannot handle incomplete data that potentially occur in all of the views. Moreover, although imputation methods have been studied for decades, our simulation studies show that even the latest imputation method might not effectively handle the nature of mixed missing patterns, and create another layer of uncertainty in the imputed data. A few recent methods handle incomplete data [22], [30], [31], [37], but they commonly assume that there is at least one complete view for all the sample subjects or each subject should have one or more complete views, which is however not the case in treatment studies (we can have incomplete features in every view).

For each view of the data, all the methods mentioned so far require either having the complete features in a view or having no features in the view. Two kernel based methods [31], [37] borrowed the idea of graph Laplacian to complete the incomplete kernel matrix. The partial multi-view clustering (PVC) method [22] reorganized the data into three parts (in the case of two views): subjects with both complete views, subjects with complete view 1, and subjects with complete view 2, and then projected them into a latent space and finally conducted a standard clustering algorithm in the latent space. When multiple incomplete views are present, clustering via weighted nonnegative matrix factorization with L21 regularization (the so-called WNMF21) is the most similar to our method which also introduces an indicator matrix. That method used only one weighted matrix to indicate which instance misses which view while we introduce an indicator matrix for each view to indicate the observed entries in the corresponding view. Among all the multi-view clustering methods with incomplete data, only ours is not restricted to any specific missing data pattern. In comparison with the common strategy of removing subjects with missing values, our approach can use all observed data in a cluster analysis. In comparison with common methods that impute missing values and then use regular multi-view analytics, our approach is less sensitive to the imputation uncertainty. In comparison with other state of the art multi-view incomplete clustering methods, our approach is applicable to any pattern of missing data. We first validate the proposed algorithm in a simulation study, and then use it in a longitudinal treatment study to better understand the differential responses of heroin users to the medication naltrexone.

The main contributions of our work include the following two aspects:

  • 1.

    In terms of methodology, we propose an enhanced multi-view co-clustering algorithm that is capable of dealing with complex patterns of incomplete data, and validate its performance by comparing against other state of the art methods.

  • 2.

    In terms of application, we have successfully applied the proposed method to an opioid dependence treatment study and identified meaningful patient subgroups, which would be implausible otherwise. By analyzing the study data, we produce an important finding that features such as changes in craving for heroin in response to cues at baseline could be a useful predictor for patient adherence to naltrexone.

The rest of this paper is organized as follows: we describe the longitudinal multi-view data collected in our treatment study in Section 2; an enhanced multi-view co-clustering method is introduced in Section 3 to deal with missing values; Section 4 presents the performance comparison on the synthetic datasets and the statistical analysis results in the case study; we then conclude and discuss in Section 5.

Section snippets

Incomplete data in treatment study

Opioid addiction is a resurgent public health problem in the United States [36]. There exist three Food and Drug Administration (FDA) approved medications for the treatment of opioid use disorder in general and heroin addiction in particular. Two of these options are opioid agonists, acting on the principle of opioid substitution and one - naltrexone, is an opioid antagonist. Naltrexone is an important treatment option because it is pharmacologically analogous to abstinence. However, the

Multi-view co-clustering with incomplete data

Multi-view co-clustering aims to group subjects in the same way across multiple views and identify the important variables from each view. In other words, multi-view co-clustering can group the subjects into some subgroups and at the same time the selected variables from different views play an important role in the grouping process. Since the selected variables from different views identify the same subject groups, the characteristics of each group helps show the correlation of the variables

Experiments

We validated the proposed approach in both simulation studies and the analysis of the clinical data collected in our heroin treatment study.

Discussion and conclusion

As data acquisition technologies advance, more and more data collected in real-world applications are from heterogeneous sources, resulting in multi-view datasets. Different views may provide complementary information. Cluster analysis in any single view may miss important cluster characteristics from other views. Simply concatenating all views together cannot guarantee finding clusters recognizable in individual views. To exploit such multiple view information, we have adopted the much-needed

Conflict of interest

None.

Acknowledgment

This work was supported by National Institutes of Health (NIH) grants R01DA037349 and K02DA043063, and National Science Foundation (NSF) grants DBI-1356655, CCF-1514357, and IIS-1718738. Jinbo Bi was also supported by NSF grants IIS-1320586, IIS-1407205, and IIS-1447711. An-Li Wang was also supported by NIH grant R00HD84746.

References (45)

  • Z. Xue et al.

    Deep low-rank subspace ensemble for multi-view clustering

    Inf. Sci.

    (2019)
  • X. Zhang et al.

    Robust low-rank kernel multi-view subspace clustering based on the schatten p-norm and correntropy

    Inf. Sci.

    (2019)
  • H. Abdi et al.

    Multiple correspondence analysis

    Encycl. Meas. Stat.

    (2007)
  • M.F. Balcan et al.

    Co-training and expansion: towards bridging theory and practice

  • A. Bargiela et al.

    The roots of granular computing

    2006 IEEE International Conference on Granular Computing

    (2006)
  • A. Blum et al.

    Combining labeled and unlabeled data with co-training

    Proceedings of the 11th Annual Conference on Computational Learning Theory

    (1998)
  • J. Bolte et al.

    Proximal alternating linearized minimization for nonconvex and nonsmooth problems

    Math. Program.

    (2014)
  • G. Chao

    Discriminative k-means laplacian clustering

    Neural Process. Lett.

    (2018)
  • G. Chao et al.

    Alternative multi-view maximum entropy discrimination

    IEEE Trans. Neural Netw. Learn. Syst.

    (2016)
  • G. Chao et al.

    Multi-kernel maximum entropy discrimination for multi-view learning

    Intell. Data Anal.

    (2016)
  • N. Eisemann et al.

    Imputation of missing values of tumour stage in population-based cancer registration

    BMC Med. Res. Method.

    (2011)
  • H. Hoffmann

    Unsupervised Learning of Visuomotor Associations

    (2005)
  • Cited by (50)

    View all citing articles on Scopus
    View full text