Donald Geman



Home Research Projects Publications Teaching Recent Talks Curriculum Vita

CURRENT PROJECTS WITH RECENT PUBLICATIONS

IMAGE INTERPRETATION

Image interpretation, which is effortless and instantaneous for human beings, is the grand challenge of computer vision. The dream is to build a "description machine" which produces a rich semantic description of the underlying scene, including the names and poses of the objects that are present, even "recognizing" other things, such as actions and context. Mathematical frameworks are advanced from time to time, but none is yet widely accepted, and none clearly points the way to closing the gap with natural vision.

Twenty Thousand Questions

We believe that efficient search and evidence integration are indispensable for annotating cluttered scenes with instances of highly deformable objects or from many object categories. Our approach is inspired by two facets of human search: divide-and-conquer querying in playing games like ``twenty questions'' and selective attention in natural vision. As in "twenty questions", the enabling assumption is that interpretations may be grouped into natural subsets with shared features. As with selective attention, we want to shift focus from one location to another with a fixed spatial scope and allow for rapid and adaptive zooming. We then design algorithms which naturally switch from monitoring the scene as a whole to local scrutiny for fine discrimination, and back again depending on current input and changes in target probabilities as events unfold.

Entropy Pursuit

More specifically, we are investigating a model-based framework for determining what evidence to acquire from multiple scales and locations, and for coherently integrating the evidence by updating likelihoods. The model is Bayesian and is designed for efficient search and scene processing in an information-theoretic sense. One component is a prior distribution on a huge interpretation vector; each bit represents a high-level scene attribute with widely varying degrees of specificity and resolution. The other component is a simple conditional data model for a corresponding family of learned binary classifiers. The classifiers are implemented sequentially and adaptively; the order of execution is determined online, during scene parsing, and is driven by removing as much uncertainty as possible about the overall scene interpretation given the evidence to date.

F. Fleuret and D. Geman, "Stationary features and cat detection," Journal of Machine Learning Research, 9:2549-2578, 2008. (pdf)
S. Gangaputra and D. Geman, "A design principle for coarse-to-fine classification," Proceedings CVPR 2006, 2, 1877-1884, 2006. (pdf)
S. Gangaputra and D. Geman, "The trace model for object detection and tracking," Towards Category-Level Object Recognition , Lecture Notes in Computer Science, 4170, 401-420, 2006. (pdf)
H. Sahbi and D. Geman, "A hierarchy of support vector machines for pattern detection," Journal of Machine Learning Research, 7, 2087-2123, 2006. (pdf)
Y. Amit, D. Geman and X. Fan, "A coarse-to-fine strategy for multi-class shape detection," IEEE Trans. PAMI, 28, 1606-1621, 2004. (pdf)

COMPUTATIONAL BIOLOGY

Research Agenda

The overall goal of our research program is to develop mathematical and algorithmic foundations for computational systems medicine. We are especially interested in providing diagnoses and treatments tailored to the molecular profile of an individual's disease. The fundamental assumption underlying the research is that diseased cells arise from perturbations in biological networks due to the net effect of interactions among multiple molecular agents, inherited and somatic DNA variants, changes in mRNA, miRNA and protein expression, and epigenetic factors such as DNA methylation. Gigantic amounts of data about these perturbations are being collected by next-generation sequencing and microarray experiments of large patient cohorts, making it possible to discover the driving differences in the abundance and activity of key biomolecules (e.g., mRNA, proteins and metabolites). Analysis of this data will enable identification of key reporters and biomarkers of networks states and uncover molecular signatures of disease.

Molecular Prediction of Disease Phenotypes

Background: A major challenge in computational biology is to extract knowledge about the genetic nature of disease from high-throughput data. A prominent example is genotype-to-phenotype prediction from gene microarray data. The corresponding mRNA counts provide a global view of cellular activity by simultaneously recording the expression levels of thousands of genes. In principle, statistical methods can enhance our understanding of human health and genetic diseases, such as cancer, by detecting the presence of disease, discriminating among cancer sub-types, predicting clinical outcomes, and characterizing disease progression.

Obstacles: Applications to biomedicine, specifically the implications for clinical practice, are widely acknowledged to remain limited. An important obstacle to both biological understanding and clinical applications is the "black box" nature of the decision rules provided by most machine learning approaches. The rules they generate lack the convenience and simplicity desired for extracting underlying biological meaning or transitioning into the clinic. Traditionally, study validation and therapeutic development are based on a small number of biomarkers (whose concentrations can then be assayed with high-resolution methods such as RT-PCR) and require understanding of the role of the genes and gene products in the context of molecular pathways. Achieving biologically relevant results argues for a different strategy.

Relative Expression Analysis: Motivated by these barriers to translational research, we are investigating decision rules based on the ordering among the expression values, searching for characteristic perturbations in this ordering from one phenotype to another. "Relative expression analysis" (RXA) is then based entirely upon the expression ordering of a small number of genes. This provides simple classification rules which involve very few genes and generate specific hypotheses for follow-up studies. Moreover, RXA has the potential to identify genomic "marker interactions" with plausible biological interpretation and possible clinical utility.

J.A. Eddy, J. Sung, D. Geman and N.D. Price, "Relative expression analysisfor molecular diagnosis and prognosis," Technology in Cancer Research and Treatment 9, 149-159, 2010. (pdf)
L.B. Edelman, G. Goia, D. Geman,, W. Zhang and N.D. Price, "Two-transcript gene expression classifiers in the diagnosis and prognosis of human diseases," BMC Genomics 10:583, 2009. (pdf)
X. Lin, B. Afsari, L. Marchionni, L. Cope, G. Parmigiani, D. Naiman and D.Geman, "The ordering of expression among a few genes can provide simple cancer biomarkers and signal BRCA1 mutations," BMC Bioinformatics 10:256, 2009. (pdf)
L. Xu, A.C. Tan, R. L. Winslow and D. Geman, "Merging microarray data from separate breast cancer studies provides a robust prognostic signature," BMC Bioinformatics 9:125, 2008. (pdf)
L. Xu, A-C Tan, D. Naiman, D. Geman and R. Winslow, "Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data," Bioinformatics 2005 21: 3905-3911. (pdf)
A-C Tan, D. Naiman, L. Xu, R. Winslow and D. Geman, "Simple decision rules for classifying human cancers from gene expression profiles," Bioinformatics 2005 21: 3896-3904. (pdf)
D. Geman, C. d'Avignon, D. Naiman and R. Winslow, "Classifying gene expression profiles from pairwise mRNA comparisons," Statist. Appl. in Genetics and Molecular Biology, 3, 2004. (pdf)

Protein Classification

Control of the packing of chromatin is essential to the regulation of gene expression. Recent studies have evidenced the importance of covalent modifications of histone N-terminal tails in this control. Histones undergo methylation or acetylation reactions, which are performed by multi-protein complexes. Some of these complexes have been shown to play important roles in various cancers: in leukemia for methylation reactions and in prostate, gastric and colorectal cancers for acetylation reactions. Among the protein domains which are found in most histone-recognition complexes are the PHD fingers, which serve as histone lysine-modification readers. With hundreds of sequences for PHD fingers available but only twenty-odd known substrates and structures, we performed a sequence-based analysis of PHD fingers in order to classify them into different families. The analysis relies on the correlated evolution of residues within PHD fingers, which was obtained by analyzing a multiple sequence alignment of over 900 PHD finger sequences. Twenty-seven 'hot-spots' were detected in the alignment, as position:residue type pairs, and were used to compare and classify the sequences. Four families were finally obtained. Comparison of the families with existing functional and structural data confirmed the obtained grouping, and suggested that the sequence clustering could indeed provide information regarding the preferred substrate for each PHD finger. The classification could thus prove useful for an estimation of substrate preferences for PHD fingers yet to be identified using solely their amino-acid sequence.

P. Slama and D. Geman, "Identification of family-determining residues in PHD fingers," Nucleic Acids Research , 1-14, 2010. (pdf)

Modeling Cell Signaling Networks

Very high-dimensional data sets are ubiquitous in computational biology and raise serious challenges in statistical learning. The difficulties are especially pronounced when the objective is to uncover a complex statistical dependency structure within a large number of variables and yet the number of samples available for learning is relatively limited. This situation reaches extremes in attemtping to reverse engineer transcriptional and signaling networks. When measured against the complexity of the systems being studied, the amount of data available for modeling is minuscule. Simply penalizing model complexity often biases estimation towards the absence of interactions. Consequently, discovering rich structure from small samples is nearly impossible unless the search is severely restricted, in which case statistical learning may become feasible due to variance reduction.

Protein signaling networks play a central role in transcriptional regulation and the etiology of many diseases. Statistical methods, particularly Bayesian networks, have been widely used to model cell signaling, mostly for model organisms and with focus on uncovering connectivity rather than inferring aberrations. Extensions to mammalian systems have not yielded compelling results, due likely to greatly increased complexity and limited proteomic measurements in vivo. We have proposed a comprehensive statistical model that is anchored to a predefined core topology, has a limited complexity due to parameter sharing and uses micorarray data of mRNA transcripts as the only observable components of signaling. Specifically, we accounted for cell heterogeneity and a multi-level process, representing signaling as a Bayesian network at the cell level, modeling measurements as ensemble averages at the tissue level and incorporating patient-to-patient differences at the population level. Motivated by the goal of identifying individual protein abnormalities as potential therapeutic targets, we applied our method to the RAS-RAF network using a breast cancer study with 118 patients. We demonstrated rigorous statistical inference, established reproducibility through simulations and the ability to recover receptor status from available microarray data. Current work is focusing on specific applications to cell-signaling in breast cancer.

E. Yoruk, M. Ochs, D. Geman and L. Younes, "A comprehensive statistical model for cell signaling," IEEE Trans. Computational Biology and Bioinformatics , 592-606, 2011. (pdf)

Molecular Network Regulation

A powerful way to separate signal from noise in biology is to convert the molecular data from individual genes or proteins into an analysis of comparative biological network behaviors. However, most existing analyses do not take into account the combinatorial nature of gene interactions within the network. Together with Nathan Price, James Eddy and Leroy Hood, I am exploring a new method for analyzing gene interactions based entirely on relative expression values of participating genes, that is, on the ordering of expression within pathway profiles. Our approach provides quantitative measures of how network rankings differ either among networks for a selected phenotype or among phenotypes for a selected network. In examining cancer subtypes and neurological disorders, we have identified networks that are tightly and loosely regulated, as defined by the level of conservation of transcript ordering, and observed a strong trend to looser network regulation in more malignant phenotypes and later stages of disease. We also demonstrate that variably expressed networks represent robust differences between disease states.

J.A. Eddy, L. Hood, N.D. Price and D. Geman, "Identifying tightly regulated and variably expressed networks by differential rank conservation," PLoS Computational Biology , 2010. (pdf)

OTHER RECENT PROJECTS

TWENTY QUESTIONS THEORY

Together with Gilles Blanchard and students at the Center for Imaging Science, I have explored the theoretical foundations of a "twenty questions" approach to pattern recognition. The object of analysis is the computational process itself rather than probability distributions (generative modeling) or decision boundaries (predictive learning). Our formulation is motivated by applications to scene interpretation in which there are a great many possible explanations for the data, one ("background") is statistically dominant, and it is imperative to restrict intensive computation to genuinely ambiguous regions. We consider sequential testing strategies in which decisions are made iteratively, based on past outcomes, about which test to perform next and when to stop testing. The key structure is a hierarchy of tests, one binary test ("question") for each cell in a tree-structured, recursive partitioning of the space of hypotheses. A central role in the mathematical analysis is played by the ratio of the "cost" of a test, meaning the amount of computation necessary for execution, to the "selectivity" of the test, meaning the probability of correcting declaring background (i.e., not hallucinating). Our main result is that, under a natural model for total computation and certain statistical assumptions on the joint distribution of the tests, coarse-to-fine strategies are optimal whenever the cost-to-selectivity ratio of the test at every cell is less than the sum of the cost-to-selectivity ratios for the children of the cell. As might be expected, under mild assumptions, good designs exhibit a steady progression from broad scope coupled with low power to high power coupled with dedication to specific explanations.

G. Blanchard and D. Geman, "Sequential testing designs for pattern recognition," Annals of Statistics, 33, 1155-1202, June, 2005. (pdf)

MENTAL IMAGE MATCHING

The standard problem in image retrieval is "query-by-visual-example": the "query image" resides in an image database and is matched by the system with other images. Together with Minel Ferecatu and researchers in the IMEDIA project at INRIA, I am considering a different scenario in which the query image or, more generally, the target category, resides only in the mind of the user as a set of subjective visual patterns, psychological impressions, or "mental pictures." Consequently, since image databases available today are often unstructured and lack reliable semantic annotations, it is often not obvious how to initiate a search session; this is the "page zero problem." We propose a new statistical framework based on relevance feedback to locate an instance of a semantic category in an unstructured image database with no semantic annotation. A search session is initiated from a random sample of images. At each retrieval round, the user is asked to select one image from among a set of displayed images -- the one that is closest in his opinion to the target class. The matching is then "mental." Performance is measured by the number of iterations necessary to display an image which satisfies the user, at which point standard techniques can be employed to display other instances. We have developed a Bayesian formulation which scales to large databases. The two key components are a response model which accounts for the user's subjective perception of similarity and a display algorithm which seeks to maximize the flow of information. Experiments with real users and two databases of 20,000 and 60,000 images demonstrate the efficiency of the search process.

M. Ferecatu and D. Geman, "A statistical framework for image category sear\ ch from a mental picture," IEEE Trans. PAMI, 31, 1087-1101, 2009. (pdf)

Y. Fang and D. Geman, "Experiments in mental face retrieval," Proceedings AVBPA 2005, Lecture Notes in Computer Science, 637-646, July 2005. (Best Student Paper Award) (pdf)