biclustering algorithm on category-sensitive descriptor

main directories:

  • The project's main directory containing the main files to run all the category-sensitive descriptor or any other measures:
      • /mnt/home/kittipat/Dropbox/random_MATLAB_codes/fMRI/mvpa_Haxby_experiment/biclustering_haxby
  • category-sensitive descriptor's main dir:
      • /mnt/home/kittipat/Dropbox/random_MATLAB_codes/toolbox_category-sensitive_measure
    • This dir contains all the base measures used in our paper, for instance Hypo_OAO/OAA, MI_OAO/OAA, IGR_OAO/OAA, etc. This toolbox requires information theoretic toolbox
  • We have 6 subjects, but only 5 whose T1 is available, so we might use only 5 subjects. All the biclustering results for each subject#X will be stored in
      • /NAS_II/Projects/MVPA_Language/haxby_data/category_sensitive_biclustering/subjectX
      • or
      • /share/Bot/Research/category_sensitive_biclustering/subjectX

In each subject directory, the files are named in the following format:

subj<subjectID>_<category-sensitive_measure>_<type>_C<desired_#_of_cluster>_<biclustering_algo_name>_<brain_region_name>_<optional>

For example,

subj1_DJS_OAA_C6_CoClust_vtc.mat

subj1_DJS_OAA_C6_CoClust_vtc.txt

subj1_DJS_OAA_C6_CoClust_vtc_category_winner.nii.gz

subj1_DJS_OAA_C6_CoClust_vtc_robust_category_winner.nii.gz

Filename.mat contains all the necessary matrices and variables as results from biclustering algorithm and the post-process.

Filename.txt contains the corresponding label# and its category names. For instance,

label#1 :Shoes
label#2 :Cat Scrambled
label#3 :House
label#4 :Bottle Chair
label#5 :Scissors
label#6 :Face

indicates that "Shoes" are labeled with 1. "Cat" and "Scrambled" are grouped together in label 2, and so on

Filename.nii/nii.gz is nifti format, which can be visualized using FSLview

How to select subspaces to analyze?

One way is to select row/column/robust winner biclusters.

Need to read more gene expression papers to get some ideas.

Nov 22, 2012: I made hierarchical biclustering algorithm toolbox available, and apply it on the subject#1. However, the robust winner is usually sporadic! Therefore, I realize what does it mean by the "real" biclustering algorithm, which is subspace clustering. Indeed, we are looking for "interesting" subspaces of the input data matrix that is the direct objective of subspace clustering algorithm. Such subspace CANNOT be obtained by simply using hierarchical biclustering because such an algorithm take all the dimension into account and not just some subspace.

Nov 23, 2012: I test the non-negative matrix factorization for biclustering from Li and Ngom. Though the clustering topology does not look stable across multiple runs, the results look OK. There are pros and cons for this method:

  • Pros:
    • we don't need to care about row/column winner because al we need is just the diagonal sub-matrix!!! --> so I will need to change the post-processing code a little bit.
    • 99% we will get the output cluster number the same as what we desire.
  • Cons: Still not completely stable clustering topology, yet consistent most of the time.

Comparison of biclustering algorithm

This table might be easier to read:

* A clustering method is stable means that it gives the same clustering topology every time we run it.

There are some good resources and classic variants of subspace clustering and biclustering:

  1. "Clustering objects on subsets of attributes" by Jerome Friedman
  2. "Clustering High-Dimensional Data: A Survey on Subspace Clustering, Pattern-Based Clustering, and Correlation Clustering" by HANS-PETER KRIEGEL, PEER KROGER, and ARTHUR ZIMEK
  3. "Model-based subspace clustering" by Peter D. Ho.ff -- non-parametric Bayesian to select the number of clusters
  4. "Comparing Subspace Clusterings" by Anne Patrikainen and Marina Meila
  5. "The Non-Negative Matrix Factorization Toolbox for Biological Data Mining" by Yifeng Li and Alioune Ngom

all the papers are downloaded already in "Download/biclustering_papers".

Preliminary results

Preliminary results part1 -- Visualization results from Sonya.