The Wikipedia Matching Example
Load the Wikipedia documents English/French Text/Graph features, do manifold matching, plot the matched embedding, and calculate the distance correlation & testing power by various nonlinear embedding algorithms.
Contents
- Back to Home
- Original Dissimilarities
- Manifold Matching for (TE, TF) without Nonlinear Algorithm
- Manifold Matching for (TE, TF) using Joint Isomap
- Manifold Matching for (TE, TF) using Separate LLE
- Manifold Matching for (TE, TF, GE) without Nonlinear Algorithm
- Manifold Matching for (TE, TF, GE) using Joint Isomap
Original Dissimilarities
To start, take the dissimilarity matrices from English/French Text/Graph features for late matching. In total we have 4 data sources, named as TE, TF, GE, GF.
clear; load ('Wiki_Data.mat','TE','TF','GE','GF');
Manifold Matching for (TE, TF) without Nonlinear Algorithm
Set up the parameters: tran=500 is the number the training pairs, numData is the number of datasets to match, dimension=10 is the matching dimension, 2*tesn is the number of testing/oos points, K is the number of neighbodhood, iter=-1 uses classical MDS whenever MDS is involved.
tran=500;numData=2;dim=10;tesn=100;K=20;iter=-1; options = struct('numData',numData,'permutation',-1,'scaling',1);
The first 500 data are matched training pairs, the next tesn=100 pairs are matched testing pairs, and the last tesn=100 pairs are un-matched testing pairs.
[dis,~,~]=GetRealData([TE TF],0,tran,tesn,options); %This function re-organizes data for training and testing purpose
First, we do joint MDS matching directly without nonlinear embedding. Note that 2*tesn points are used for testing and embedded by out-of-sample technique.
options = struct('nonlinear',0,'match',0,'neighborSize',K,'jointSelection',0,'numData',numData,'oos',2*tesn,'maxIter',iter); [sol, dCorr]=ManifoldMatching(dis,dim,options);
After matching, we check the matchedness by connecting each pair by black line, there are some matched patterns in the embedding. But the unmatched pairs are also well matched, dragging down the testing power.
plotVelocity([sol(:,1:tran) sol(:,tran+2*tesn+1:2*tran+2*tesn)],options.numData); title('Training Matched Data'); xlim([-0.5 0.5]) ylim([-0.5 0.5]) zlim([-0.5 0.5]) plotVelocity([sol(:,tran+1:tran+tesn) sol(:,2*tran+2*tesn+1:2*tran+3*tesn)],options.numData); title('Testing Matched Data'); xlim([-0.5 0.5]) ylim([-0.5 0.5]) zlim([-0.5 0.5]) plotVelocity([sol(:,tran+tesn+1:tran+2*tesn) sol(:,2*tran+3*tesn+1:2*tran+4*tesn)],options.numData); title('Testing Unmatched Data'); xlim([-0.5 0.5]) ylim([-0.5 0.5]) zlim([-0.5 0.5])



We can check the distance correlation of the training data, as well as the matching test power of the testing data at critical level 0.05. Straight matching has a good correlation, but the testing power is not good enough.
dCorr p=plotPower(sol,numData,tesn,20); p(2)
dCorr = 0.9226 ans = 0.4700
Manifold Matching for (TE, TF) using Joint Isomap
Then we repeat the same procedure using joint Isomap with joint MDS matching.
options = struct('nonlinear',1,'match',0,'neighborSize',K,'jointSelection',1,'numData',numData,'oos',2*tesn,'maxIter',iter); [sol, dCorr]=ManifoldMatching(dis,dim,options);
After matching, we check the matchedness by connecting each pair by black line. The training data is very well matched. The testing matched pairs are reasonably matched with the testing unmatched pairs being clearly unmatched. This improves the testing power.
plotVelocity([sol(:,1:tran) sol(:,tran+2*tesn+1:2*tran+2*tesn)],options.numData); title('Training Matched Data'); xlim([-2 2]) ylim([-2 4]) zlim([-1.5 1.5]) plotVelocity([sol(:,tran+1:tran+tesn) sol(:,2*tran+2*tesn+1:2*tran+3*tesn)],options.numData); title('Testing Matched Data'); xlim([-2 2]) ylim([-2 4]) zlim([-1.5 1.5]) plotVelocity([sol(:,tran+tesn+1:tran+2*tesn) sol(:,2*tran+3*tesn+1:2*tran+4*tesn)],options.numData); title('Testing Unmatched Data'); xlim([-2 2]) ylim([-2 4]) zlim([-1.5 1.5])



The distance correlation and the testing power are both better comparing to no nonlinear algorithm.
dCorr p=plotPower(sol,numData,tesn,20); p(2)
dCorr = 0.9843 ans = 0.7800
Manifold Matching for (TE, TF) using Separate LLE
Next we repeat the same procedure using separate LLE with Joint MDS matching.
options = struct('nonlinear',2,'match',0,'neighborSize',K,'jointSelection',0,'numData',numData,'oos',2*tesn,'maxIter',iter); [sol, dCorr]=ManifoldMatching(dis,dim,options);
After matching, we again check the matchedness by connecting each pair by black line. The testing data are quite difficult to distinguish.
plotVelocity([sol(:,1:tran) sol(:,tran+2*tesn+1:2*tran+2*tesn)],options.numData); title('Training Matched Data'); plotVelocity([sol(:,tran+1:tran+tesn) sol(:,2*tran+2*tesn+1:2*tran+3*tesn)],options.numData); title('Testing Matched Data'); xlim([-2 1]) ylim([-2 1]) zlim([-2 3]) plotVelocity([sol(:,tran+tesn+1:tran+2*tesn) sol(:,2*tran+3*tesn+1:2*tran+4*tesn)],options.numData); title('Testing Unmatched Data'); xlim([-2 1]) ylim([-2 1]) zlim([-2 3])



Both the distance correlation on the training data and the matching test power are significantly worse than joint Isomap.
dCorr p=plotPower(sol,numData,tesn,20); p(2)
dCorr = 0.8584 ans = 0.5000
Manifold Matching for (TE, TF, GE) without Nonlinear Algorithm
At last, we show a three dataset matching example, using almost the same parameters except changing numData to 3.
tran=500;numData=3;dim=10;tesn=100;K=20;iter=-1; options = struct('numData',numData,'permutation',-1,'scaling',1); [dis,~,~]=GetRealData([TE TF GE],0,tran,tesn,options); %This function re-organizes data for training and testing purpose
We do joint MDS matching directly without nonlinear embedding, and just check the distance correlation and testing power.
options = struct('nonlinear',0,'match',0,'neighborSize',K,'jointSelection',0,'numData',numData,'oos',2*tesn,'maxIter',iter); [sol, dCorr]=ManifoldMatching(dis,dim,options); dCorr p=plotPower(sol,numData,tesn,20); p(2)
dCorr = 2.1105 ans = 0.4400
Manifold Matching for (TE, TF, GE) using Joint Isomap
Then we repeat the same procedure using joint Isomap with joint MDS matching, and check the distance correlation and testing power. They are much better using joint isomap than no nonlinear algorithm.
options = struct('nonlinear',1,'match',0,'neighborSize',K,'jointSelection',1,'numData',numData,'oos',2*tesn,'maxIter',iter); [sol, dCorr]=ManifoldMatching(dis,dim,options); dCorr p=plotPower(sol,numData,tesn,20); p(2)
dCorr = 2.7116 ans = 0.9200
All the above experiments can be repeated; which we repeat 100 times in our paper for randomly selected data for testing.