Validation of nonlinear PCA

Matthias Scholz, Neural Processing Letters, 2012

Since nonlinear PCA is an unsupervised model, standard techniques for model selection, including cross-validation or more generally the use of an independent test set, fail when applied to nonlinear PCA.
But instead we can validate the complexity of the nonlinear PCA model by using the error in missing data estimation as a criterion for model selection. This is motivated by the idea that only the model of optimal complexity is able to predict missing values with the highest accuracy.

Keywords: model selection, model complexity, validation



Matlab script for validating a nonlinear PCA model

1) split your data into "traindata" and "testdata"

  % example: Gaussian data (linear data)
      traindata = randn(2,1000); 
      testdata  = randn(2,1000);

2) choose a specific model complexity and train the nonlinear PCA model (download nonlinear PCA)

   weightdecay = 0.001 
 
   [c,net,network]=nlpca(traindata,1,...
     'mode'                    ,'symmetric'   ,...
     'type'                    ,'inverse'     ,...
     'units_per_layer'         ,[ 1 , 6 , size(traindata,1) ],... 
     'weight_decay'            ,'yes'           ,...     
     'weight_decay_coefficient', weightdecay ,... 
     'max_iteration'           ,5000); 

3) get the validation error based on missing data estimation

     % set randomly one value per sample-column as missing  
        [s,idx]=sort(rand(size(testdata)));
        testdataNaN=testdata;
        testdataNaN(idx==1)=NaN;
     
     % reconstructing test data including missing values       
        pc_test = nlpca_get_components(net,testdataNaN);
        data_recon=nlpca_get_data(net,pc_test);
        e = (data_recon-testdata).^2;

     % as validation error, we only use the missing data reconstruction error
        testerrorNaN = mean(e(isnan(testdataNaN))); 

Classical train and test error

Classical train and test error cannot be used for validation as shown in Scholz (2012).
Please use the following lines only as comparison and not for validating the nonlinear PCA model.

    % get classical train error 

       data_recon=nlpca_get_data(net);
       e = (data_recon-traindata).^2;
       trainerror = mean(mean(e));
     
    % get classical test error 
     
       pc_test = nlpca_get_components(net,testdata);
       data_recon=nlpca_get_data(net,pc_test);
       e = (data_recon-testdata).^2;
       testerror =mean(mean(e));

Reference

Validation of nonlinear PCA.
Matthias Scholz
Neural Processing Letters, Volume 36, Number 1, Pages 21-30, 2012.
[ pdf (pre-print) | pdf (Neural Process Lett) | poster RECOMB 2012 | Matlab code]

See also:

Matthias Scholz
Nonlinear PCA