{"title": "Computing Gaussian Mixture Models with EM Using Equivalence Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 472, "abstract": "", "full_text": "Computing Gaussian Mixture Models with EM\n\nusing Equivalence Constraints\n\nNoam Shental\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\nAharon Bar-Hillel\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\nfenoam@cs.huji.ac.il\n\naharonbh@cs.huji.ac.il\n\nTomer Hertz\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\nDaphna Weinshall\n\nComputer Science & Eng.\n\nCenter for Neural Computation\nHebrew University of Jerusalem\n\nJerusalem, Israel 91904\n\ntomboy@cs.huji.ac.il\n\ndaphna@cs.huji.ac.il\n\nAbstract\n\nDensity estimation with Gaussian Mixture Models is a popular gener-\native technique used also for clustering. We develop a framework to\nincorporate side information in the form of equivalence constraints into\nthe model estimation procedure. Equivalence constraints are de\ufb01ned on\npairs of data points, indicating whether the points arise from the same\nsource (positive constraints) or from different sources (negative con-\nstraints). Such constraints can be gathered automatically in some learn-\ning problems, and are a natural form of supervision in others. For the\nestimation of model parameters we present a closed form EM procedure\nwhich handles positive constraints, and a Generalized EM procedure us-\ning a Markov net which handles negative constraints. Using publicly\navailable data sets we demonstrate that such side information can lead to\nconsiderable improvement in clustering tasks, and that our algorithm is\npreferable to two other suggested methods using the same type of side\ninformation.\n\n1 Introduction\n\nWe are used to thinking about learning from labels as supervised learning, and learning\nwithout labels as unsupervised learning, where \u2019supervised\u2019 implies the need for human\nintervention. However, in unsupervised learning we are not limited to using data statistics\nonly. Similarly supervised learning is not limited to using labels. In this work we focus\non semi-supervised learning using side-information, which is not given as labels. More\nspeci\ufb01cally, we use unlabeled data augmented by equivalence constraints between pairs\nof data points, where the constraints determine whether each pair was generated by the\n\n\fsame source or by different sources. Such constraints may be acquired without human\nintervention in a broad class of problems, and are a natural form of supervision in other\nscenarios. We show how to incorporate equivalence constraints into the EM algorithm [1],\nin order to \ufb01t a generative Gaussian mixture model to the data.\n\nDensity estimation with Gaussian mixture models is a popular generative technique, mostly\nbecause it is computationally tractable and often produces good results. However, even\nwhen the approach is successful, the underlying assumptions (i.e., that the data is gener-\nated by a mixture of Gaussian sources) may not be easily justi\ufb01ed. It is therefore important\nto have additional information which can steer the GMM estimation in the \u201cright\u201d direc-\ntion. In this paper we propose to incorporate equivalence constraints into an EM parameter\nestimation algorithm. One added value may be a faster convergence to a high likelihood\nsolution. Much more importantly, the constraints change the GMM likelihood function and\ntherefore may lead the estimation procedure to choose a better solution which would have\notherwise been rejected (due to low relative likelihood in the unconstrained GMM density\nmodel). Ideally the solution obtained with side information will be more faithful to the\ndesired results. A simple example demonstrating this point is shown in Fig. 1.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Illustrative examples for the importance of equivalence constraints. Left: the data set con-\nsists of 2 vertically aligned classes - (a) given no additional information, the EM algorithm identi\ufb01es\ntwo horizontal classes, and this can be shown to be the maximum likelihood solution (with log likeli-\nhood of \u22123500 vs. log likelihood of \u22122800 for the solution in (b)); (b) additional side information in\nthe form of equivalence constraints changes the probability function and we get a vertical partition as\nthe most likely solution. Right: the dataset consists of two classes with partial overlap - (c) without\nconstraints the most likely solution includes two non-overlapping sources; (d) with constraints the\ncorrect model with overlapping classes was retrieved as the most likely solution. In all plots only the\nclass assignment of novel un-constrained points is shown.\n\nEquivalence constraints are binary functions of pairs of points, indicating whether the two\npoints come from the same source or from two different sources. We denote the \ufb01rst case\nas \u201cis-equivalent\u201d constraints, and the second as \u201cnot-equivalent\u201d constraints. As it turns\nout, \u201cis-equivalent\u201d constraints can be easily incorporated into EM, while \u201cnot-equivalent\u201d\nconstraints require heavy duty inference machinery such as Markov networks. We describe\nthe derivations in Section 2.\n\nOur choice to use equivalence constraints is motivated by the relative abundance of equiv-\nalence constraints in real life applications. In a broad family of applications, equivalence\nconstraints can be obtained without supervision. Maybe the most important source of un-\nsupervised equivalence constraints is temporal continuity in data; for example, in video\nindexing a sequence of faces obtained from successive frames in roughly the same location\nare likely to contain the same unknown individual. Furthermore, there are several learning\napplications in which equivalence constraints are the natural form of supervision.\n\nOne such scenario occurs when we wish to enhance a retrieval engine using supervision\nprovided by its users. The users may be asked to help annotate the retrieved set of data\npoints, in what may be viewed as \u2019generalized relevance feedback\u2019. The categories given\n\nUnconstrainedconstrainedunconstrainedconstrained\fby the users have subjective names that may be inconsistent. Therefore, we can only extract\nequivalence constraints from the feedback provided by the users. Similar things happen in\na \u2019distributed learning\u2019 scenario, where supervision is provided by several uncoordinated\nteachers.\nIn such scenarios, when equivalence constraints are obtained in a supervised\nmanner, our method can be viewed as a semi-supervised learning technique. Most of the\nwork in the \ufb01eld of semi-supervised learning focused on the case of partial labels augment-\ning a large unlabeled data set [4, 8, 5].\n\nA few recent papers use side information in the form of equivalence constraints [6, 7, 10].\nIn [9] equivalence constraints were introduced into the K-means clustering algorithm. The\nalgorithm is closely related to our work since it allows for the incorporation of both \u201cis-\nequivalent\u201d and \u201cnot-equivalent\u201d constraints.\nIn [3] equivalence constraints were intro-\nduced into the complete linkage clustering algorithm. In comparison with both approaches,\nwe gain signi\ufb01cantly better clustering results by introducing the constraints into the EM al-\ngorithm. One reason may be that the EM of a Gaussian mixture model is preferable as\na clustering algorithm. More importantly, the probabilistic semantics of the EM proce-\ndure allows for the introduction of constraints in a principled way, thus overcoming many\ndrawbacks of the heuristic approaches. Comparative results are given in Section 3, demon-\nstrating the very signi\ufb01cant advantage of our method over the two alternative constrained\nclustering algorithms using a number of data sets from the UCI repository and a large\ndatabase of facial images [2].\n\n2 Constrained EM: the update rules\n\nA Gaussian mixture model (GMM) is a parametric statistical model which assumes that the\ndata originates from a weighted sum of several Gaussian sources. More formally, a GMM\nis given by p(x|\u0398) = \u03a3M\nl=1\u03b1lp(x|\u03b8l), where \u03b1l denotes the weight of each Gaussian, \u03b8l its\nrespective parameters, and M denotes the number of Gaussian sources in the GMM. EM\nis a widely used method for estimating the parameter set of the model (\u0398) using unlabeled\ndata [1]. Equivalence constraints modify the \u2019E\u2019 (expectation computation) step, such that\nthe sum is taken only over assignments which comply with the given constraints (instead\nof summing over all possible assignments of data points to sources).\n\nIt is important to note that there is a basic difference between \u201cis-equivalent\u201d (positive)\nand \u201cnot-equivalent\u201d (negative) constraints: While positive constraints are transitive (i.e.\na group of pairwise \u201cis-equivalent\u201d constraints can be merged using a transitive closure),\nnegative constraints are not transitive. The outcome of this difference is expressed in the\ncomplexity of incorporating each type of constraint into the EM formulation. Therefore, we\nbegin by presenting a formulation for positive constraints (Section 2.1), and then present a\ndifferent formulation for negative constraints (Section 2.2). A uni\ufb01ed formulation for both\ntypes of constraints immediately follows, and the details are therefore omitted.\n\n2.1 Incorporating positive constraints\n\nLet a chunklet denote a small subset of data points that are known to belong to a single\nunknown class. Chunklets may be obtained by applying the transitive closure to the set of\n\u201cis-equivalent\u201d constraints.\nAssume as given a set of unlabeled data points and a set of chunklets. In order to write\ndown the likelihood of a given assignment of points to sources, a probabilistic model of\nhow chunklets are obtained must be speci\ufb01ed. We consider two such models:\n\n1. Chunklets are sampled i.i.d, with respect to the weight of their corresponding\n\nsource (points within each chunklet are also sampled i.i.d).\n\n2. Data points are sampled i.i.d, without any knowledge about their class member-\n\nship, and only afterwards chunklets are selected from these points.\n\n\fThe \ufb01rst assumption may be appropriate when chunklets are automatically obtained using\ntemporal continuity. The second sampling assumption is appropriate when equivalence\nconstraints are obtained using distributed learning. When incorporating these sampling\nassumptions into the EM formulation for GMM \ufb01tting, different algorithms are obtained:\nUsing the \ufb01rst assumption we obtain closed-form update rules for all of the GMM parame-\nters. When the second sampling assumption is used there is no closed-form solution for the\nsources\u2019 weights. In this section we therefore restrict the discussion to the \ufb01rst sampling\n(cid:80)M\nassumption only; the discussion of the second sampling assumption, where generalized EM\nmust be used, is omitted.\nl=1 \u03b1l pl(x|\u03b8l) denote our GMM. Each pl(x|\u03b8l) term is a\nMore speci\ufb01cally, let p(x) =\nGaussian parameterized by \u03b8l = (\u00b5l, \u03a3l) with a mixing coef\ufb01cient \u03b1l. Let X denote the\nj=1, L \u2264 N denote the distinct chunklets,\nset of all data points, X = {xi}N\ni=1. Let {Xj}L\ni=1 (unconstrained data\nwhere each Xj is a set of points xi such that\npoints appear as chunklets of size one). Let Y = {yi}N\ni=1 denote the source assignment\nof the respective data-points, and Yj = {y1\n} denote the source assignment of the\nchunklet Xj. Finally, let E\u2126 denote the event {Y complies with the constraints}.\nThe expectation of the log likelihood is the following:\nE[log(p(X, Y|\u0398new, E\u2126))|X \u0398old, E\u2126] =\n\n(cid:83)L\nj=1 Xj = {xi}N\n|Xj|\nj . . . y\nj\n\nlog(p(X, Y|\u0398new, E\u2126)) \u00b7p(Y|X, \u0398old, E\u2126)\n\n(cid:80)\n(1)\nY \u2261\nyN =1. In the following discussion we shall also reorder the sum according to\n\nY stands for a summation over all assignments of points to sources:\n\n(cid:80)M\n\n(cid:88)\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\nwhere\n\nY\n\n(cid:80)M\nY \u2261(cid:80)\n(cid:80)\n\n\u00b7\u00b7\u00b7(cid:80)\n\ny1=1 . . .\nchunklets:\n\nY1 . . .\n\n, where\n\nstands for\n\nYj\n\nYL\n\nyj\n1\n\nyj\n|Xj|\n(cid:80)\nFirst, using Bayes rule and the independence of chunklets, we can write\np(E\u2126|Y, X, \u0398old) p(Y|X, \u0398old)\n(cid:81)L\nY p(E\u2126|Y, X, \u0398old) p(Y|X, \u0398old)\n(cid:80)\n(cid:80)\n(cid:81)L\nj=1 \u03b4Yj p(Yj|Xj, \u0398old)\nj=1 \u03b4Yj p(Yj|Xj, \u0398old)\n\np(Y|X, \u0398old, E\u2126) =\n\nY1 . . .\n\n=\n\nYL\n\n.\n\n(2)\n\nequals 1 if all the points in chunklet i have the same label, and 0\n\nwhere \u03b4Yj \u2261 \u03b4yj\notherwise.\n\n1,...,yj\n\n|Xj|\n\nNext, using chunklet independence and the independence of points within a chunklet we\nsee that\n\np(X, Y|\u0398new, E\u2126) = p(Y|\u0398new, E\u2126) p(X|Y, \u0398new, E\u2126)\n\nHence the log-likelihood is:\n\nlog p(X, Y|\u0398new, E\u2126) =\n\nlog p(xi|yi, \u0398new) +\n\nlog(\u03b1yj )\n\n(3)\n\nFinally, we substitute (3) and (2) into (1); after some manipulations, we obtain the following\nexpression:\n\nE(LogLikelihood) =\n\n+\n\nlog p(xi|l, \u0398new) \u00b7 p(Yj = l|Xj, \u0398old)\n\nxi\u2208Xj\nlog \u03b1l \u00b7 p(Yj = l|Xj, \u0398old)\n\nN(cid:89)\n\ni=1\n\n\u03b1yj\n\np(xi|yi, \u0398new)\n\nL(cid:88)\n\nj=1\n\n=\n\nj=1\n\nL(cid:89)\n(cid:88)\nL(cid:88)\nL(cid:88)\nM(cid:88)\n(cid:88)\nL(cid:88)\nM(cid:88)\n\nxi\u2208Xj\n\nj=1\n\nj=1\n\nl=1\n\nl=1\n\nj=1\n\n\fwhere the chunklet posterior probability is:\n\u03b1old\n\nl\n\np(Yj = l|Xj, \u0398old) =\n\n(cid:80)M\n\n(cid:81)\n\nxi\u2208Xj\n\np(xi|yj\n\ni = l, \u0398old)\n\np(xi|yj\n\ni = m, \u0398old)\n\nm=1 \u03b1old\n\nm\n\nxi\u2208Xj\n\n(cid:81)\n\nTo \ufb01nd the update rule for each parameter, we differentiate (4) with respect to \u00b5l, \u03a3l and\n\u03b1l. We get the following rules:\n\nj=1\n\n1\nL\n\np(Yj = l|Xj, \u0398old)\n\nL(cid:88)\n(cid:80)L\n(cid:80)L\n\u00afXjp(Yj = l|Xj, \u0398old)|Xj|\n(cid:80)L\nj=1 p(Yj = l|Xj, \u0398old)|Xj|\n(cid:80)L\nj=1 \u03a3new\nj=1 p(Yj = l|Xj, \u0398old)|Xj|\n\njl p(Yj = l|Xj, \u0398old)|Xj|\n\nj=1\n\n\u03b1new\n\nl\n\n=\n\n\u00b5new\n\nl\n\n=\n\n\u03a3new\n\nl\n\n=\n\nwhere \u00afXj denotes the sample mean of the points in chunklet j, |Xj| denotes the number of\npoints in chunklet j, and \u03a3new\ndenotes the sample covariance matrix of the jth chunklet of\nthe lth class.\n\njl\n\nAs can be readily seen, the update rules above effectively treat each chunklet as a single\ndata point weighed according to the number of elements in it.\n\n2.2 Incorporating negative constraints\n\nThe probabilistic description of a data set using a GMM attaches to each data point two\nrandom variables: an observable and a hidden. The hidden variable of a point describes its\nsource label, while the data point itself is an observed example from the source. Each pair\nof observable and hidden variables is assumed to be independent of the other pairs. How-\never, negative equivalence constraints violate this assumption, as dependencies between\nthe hidden variables are introduced.\nSpeci\ufb01cally, assume we have a group \u2126 = {(a1\ni=1 of index pairs correspond-\ning to P pairs of points that are negatively constrained, and de\ufb01ne the event E\u2126 =\n{Y complies with the constraints}. Now\np(X, Y|\u0398, E\u2126) = p(X|Y, \u0398, E\u2126) p(Y|\u0398, E\u2126) = p(X|Y, \u0398) p(E\u2126|Y) p(Y|\u0398)\n(cid:81)N\nLet Z denote the constant p(E\u2126|\u0398). Assuming sample independence, it follows that\ni=1 p(yi|\u0398)p(xi|yi, \u0398). By de\ufb01nition p(E\u2126|Y) = 1Y\u2208E\u2126,\np(X|Y, \u0398) \u00b7 p(Y|\u0398) =\nhence\n\np(E\u2126|\u0398)\n\ni )}P\n\ni , a2\n\nN(cid:89)\n\np(X, Y|\u0398, E\u2126) =\n\n1\nZ\n\n1Y\u2208E\u2126\n\np(yi|\u0398)p(xi|yi, \u0398)\n\ni=1\nExpanding 1Y\u2208E\u2126 gives the following expression\n(1 \u2212 \u03b4ya1\n\np(X, Y|\u0398, E\u2126) =\n\n(cid:89)\n\n1\nZ\n\n,ya2\ni\n\ni\n\nN(cid:89)\n\ni=1\n\n)\n\np(yi|\u0398)p(xi|yi, \u0398)\n\n(a1\n\ni ,a2\ni )\n\n(4)\n\n(5)\n\nAs a product of local components, the distribution in (5) can be readily described using a\nMarkov network. The network nodes are the hidden source variables and the observable\n\n\fdata point variables. The potential p(xi|yi, \u0398) connects each observable data point, in a\nGaussian manner, to a hidden variable corresponding to the label of its source. Each hidden\nsource node holds an initial potential of p(yi|\u0398) re\ufb02ecting the prior of the cluster weights.\nNegative constraints are expressed by edges between hidden variables which prevent them\nfrom having the same value. A potential over an edge (a1\n,ya2\ni\n(see Fig. 2).\n\ni ) is expressed by 1\u2212 \u03b4ya1\n\ni \u2212 a2\n\ni\n\nFigure 2: An illustration of the Markov network required for incorporating \u201cnot-equivalent\u201d con-\nstraints. Data points 1 and 2 have a negative constraint, and so do points 2 and 3.\nWe derived an EM procedure which maximizes log(p(X|\u0398, E\u2126)) entailed by this distribu-\ntion. The update rules for \u00b5l and \u03a3l are still\n\n(cid:99)\u03a3ilp(yi = l|X, \u0398old, E\u2126)\n\n(cid:80)N\n(cid:80)N\ni=1 p(yi = l|X, \u0398old, E\u2126)\n\ni=1\n\n, \u03a3new\n\nl =\n\n(cid:80)N\n(cid:80)N\ni=1 xip(yi = l|X, \u0398old, E\u2126)\ni=1 p(yi = l|X, \u0398old, E\u2126)\n)(xi \u2212 \u00b5new\n\nwhere (cid:99)\u03a3il = (xi \u2212 \u00b5new\n\nl =\n\u00b5new\n\nl\n\n)T denotes the sample covariance matrix. Note,\nhowever, that now the vector of probabilities p(yi = l|X, \u0398old, E\u2126) is inferred using the\nnet.\nThe update rule of \u03b1l = p(yi = l|\u0398new, E\u2126) is more intricate, since this parameter appears\nin the normalization factor Z in the likelihood expression (4):\n\nl\n\n(cid:88)\n\n(cid:88)\n\nN(cid:89)\n\n(cid:89)\n\n(cid:88)\n\nZ = p(E\u2126|\u0398) =\n\np(Y|\u0398)p(E\u2126|Y) =\n\n...\n\n\u03b1yi\n\n(1 \u2212 \u03b4ya1\n\n)\n\n,ya2\ni\n\n(6)\n\ni\n\nY\n\ny1\n\nyN\n\ni=1\n\n(a1\n\ni ,a2\ni )\n\nreduced to the relatively simple expression: Z = (1 \u2212(cid:80)M\n\nThis factor can be calculated using a net which is similar to the one discussed above but\nlacks the observable nodes. We use such a net to calculate Z and differentiate it w.r.t \u03b1l,\nafter which we perform gradient ascent. Alternatively, we can approximate Z by assuming\nthat the pairs of negatively constrained points are disjoint. Using such an assumption, Z is\ni )P . This expression for Z\ncan be easily differentiated, and can be used in the Generalized EM scheme. Although the\nassumption is not valid in most cases, it is a reasonable approximation in sparse networks,\nand our empirical tests show that it gives good results.\n\ni=1 \u03b12\n\n3 Experimental results\n\nIn order to evaluate the performance of our EM derivations and compare it to the con-\nstrained K-means [9] and constrained complete linkage [3] algorithms, we tested all 3 al-\ngorithms using several data sets from the UCI repository and a real multi-class facial image\ndatabase [2]. We simulated a \u2019distributed learning\u2019 scenario in order to obtain side informa-\ntion. In this scenario equivalence constraints are obtained by employing N uncoordinated\nteachers. Each teacher is given a random selection of K data points from the data set, and is\nthen asked to partition this set of points into equivalence classes. The constraints provided\n\n\fFigure 3: Combined precision and recall scores (f 1\n) of several clustering algorithms over 5 data\n2\nsets from the UCI repository, and 1 facial image database (YaleB). The YaleB dataset contained a\ntotal of 640 images including 64 frontal pose images of 10 different subjects. In this dataset the vari-\nability between images of the same person was due mainly to different lighting conditions. Results\nare presented for the following algorithms: (a) K-means, (b) constrained K-means using only posi-\ntive constraints, (c) constrained K-means using both positive and negative constraints, (d) complete\nlinkage, (e) complete linkage using positive constraints, (f) complete linkage using both positive and\nnegative constraints, (g) regular EM, (h) EM using positive constraints, and (i) EM using both posi-\ntive and negative constraints. In each panel results are shown for two cases, using 15% of the data\npoints in constraints (left bars) and 30% of the points constrained (right bars). The results were av-\neraged over 100 realizations of constraints for the UCI datasets, and 1000 realizations for the YaleB\ndataset. Also shown are the names of the data sets used and some of their parameters: N - the size of\nthe data set; C - the number of classes; d - the dimensionality of the data.\n\nby the teachers are gathered and used as equivalence constraints. Each of the 3 algorithms\n(constrained EM, constrained K-means, and constrained complete linkage) was tested in\nthree modes: (i) basic algorithm without using any side information, (ii) constrained ver-\nsion using only positive equivalence constraints, and (iii) constrained version using both\npositive and negative equivalence constraints. The results of the 9 algorithmic variants are\ncompared in Fig. 3.\n\nIn the simulations, the number of constrained points was determined by the number of\nteachers N and the size of the subset K given to each. By controlling the product N K\nwe controlled the amount of side information provided to the learning algorithms. We\nexperimented with two conditions: using \u201clittle\u201d side information (approximately 15% of\nthe data points are constrained) and using \u201cmuch\u201d side information (approximately 30%\nof the points are constrained). All algorithms were given initial conditions that did not\ntake into account the available equivalence constraints. The results were evaluated using a\ncombined measure of precision P and recall R scores: f 1\n2\n\n= 2P R\nR+P .\nSeveral effects can clearly be seen in the results reported in Fig. 3:\n\n\u2022 The constrained EM outperformed the two alternative algorithms in almost all\ncases, while showing substantial improvement over the baseline EM. The one\ncase where constrained complete linkage outperformed all other algorithms in-\nvolved the \u201cwine\u201d dataset. In this dataset the data lies in a high-dimensional space\n(R12) and therefore the number of model parameters to be estimated by the EM\n\n a b c d e f g h i a b c d e f g h i 0.50.60.70.80.91\"little\"\"much\"BALANCE N=625 d=4 C=3f1/2 a b c d e f g h i a b c d e f g h i 0.50.60.70.80.91\"little\"\"much\"BOSTON N=506 d=13 C=3f1/2 a b c d e f g h i a b c d e f g h i 0.50.60.70.80.91\"little\"\"much\"IONOSPHERE N=351 d=34 C=2f1/2 a b c d e f g h i a b c d e f g h i 0.50.60.70.80.91\"little\"\"much\"PROTEIN N=116 d=20 C=6f1/2 a b c d e f g h i a b c d e f g h i 0.50.60.70.80.91\"little\"\"much\"WINE N=168 d=12 C=3f1/2 a b c d e f g h i a b c d e f g h i 0.10.20.30.40.50.60.70.80.9\"little\"\"much\"YaleB N=640 d=60 C=10f1/2\falgorithm is relatively large. The EM procedure was not able to \ufb01t the data well\neven with constraints, probably due to the fact that only 168 data points were\navailable for training.\n\u2022 Introducing side information in the form of equivalence constraints clearly im-\nproves the results of both K-means and the EM algorithms. This is not always\ntrue for the constrained complete linkage algorithm. As the amount of side-\ninformation increases, performance typically improves.\n\u2022 Most of the improvement can be attributed to the positive constraints, and can be\nachieved using our closed form EM version. In most cases adding the negative\nconstraints contributes a small but signi\ufb01cant improvement over results obtained\nwhen using only positive constraints.\n\nReferences\n\n[1] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. JRSSB, 39:1\u201338, 1977.\n\n[2] A. Georghiades, P.N. Belhumeur, and D.J. Kriegman. From few to many: Generative mod-\nels for recognition under variable pose and illumination. IEEE international Conference on\nAutomatic Face and Gesture Recognition, pages 277\u2013284, 2000.\n\n[3] D. Klein, Sepandar D. Kamvar, and Christopher D. Manning. From instance-level constraints\nto space-level constraints: Making the most of prior knowledge in data clustering. In ICML,\n2002.\n\n[4] D. Miller and S. Uyar. A mixture of experts classi\ufb01er with learning based on both labelled and\nunlabelled data. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, NIPS 9, pages 571\u2013578.\nMIT Press, 1997.\n\n[5] K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell. Learning to classify text from labeled\nand unlabeled documents. In Proceedings of AAAI-98, pages 792\u2013799, Madison, US, 1998.\nAAAI Press, Menlo Park, US.\n\n[6] P.J. Phillips. Support vector machines applied to face recognition. In M. C. Mozer, M. I. Jordan,\n\nand T. Petsche, editors, NIPS 11, page 803ff. MIT Press, 1998.\n\n[7] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component\nIn A. Heyden, G. Sparr, M. Nielsen, and P. Johansen, editors, Computer Vision -\n\nanalysis.\nECCV 2002, volume 4, page 776ff, 2002.\n\n[8] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with markov random walks. In\n\nNIPS, volume 14. The MIT Press, 2001.\n\n[9] K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained K-means clustering with\nbackground knowledge. In Proc. 18th International Conf. on Machine Learning, pages 577\u2013\n584. Morgan Kaufmann, San Francisco, CA, 2001.\n\n[10] E.P Xing, A.Y. Ng, M.I. Jordan, and S. Russell. Distance metric learnign with application\nto clustering with side-information. In Advances in Neural Information Processing Systems,\nvolume 15. The MIT Press, 2002.\n\n\f", "award": [], "sourceid": 2512, "authors": [{"given_name": "Noam", "family_name": "Shental", "institution": null}, {"given_name": "Aharon", "family_name": "Bar-hillel", "institution": null}, {"given_name": "Tomer", "family_name": "Hertz", "institution": null}, {"given_name": "Daphna", "family_name": "Weinshall", "institution": null}]}