|
CURRENT PROJECTS WITH RECENT PUBLICATIONS
IMAGE INTERPRETATION
Image interpretation, which is effortless and instantaneous for human
beings, is the grand challenge of computer vision. The dream is to
build a "description machine" which produces a rich semantic
description of the underlying scene, including the names and poses of
the objects that are present, even "recognizing" other things, such as
actions and context. Mathematical frameworks are advanced from time
to time, but none is yet widely accepted, and none clearly points the
way to closing the gap with natural vision.
Twenty Thousand Questions
We believe that efficient search and evidence integration are
indispensable for annotating cluttered scenes with instances of highly
deformable objects or from many object categories. Our approach is
inspired by two facets of human search: divide-and-conquer querying in
playing games like ``twenty questions'' and selective attention in
natural vision. As in "twenty questions", the enabling assumption is
that interpretations may be grouped into natural subsets with shared
features. As with selective attention, we want to shift focus
from one location to another with a fixed spatial scope and allow for
rapid and adaptive zooming. We then design algorithms which
naturally switch from monitoring the scene as a whole to local
scrutiny for fine discrimination, and back again depending on current
input and changes in target probabilities as events unfold.
Entropy Pursuit
More specifically, we are investigating a model-based framework for
determining what evidence to acquire from multiple scales and
locations, and for coherently integrating the evidence by updating
likelihoods. The model is Bayesian and is designed for efficient
search and scene processing in an information-theoretic sense. One
component is a prior distribution on a huge interpretation vector;
each bit represents a high-level scene attribute with widely varying
degrees of specificity and resolution. The other component is a
simple conditional data model for a corresponding family of learned
binary classifiers. The classifiers are implemented sequentially and
adaptively; the order of execution is determined online, during
scene parsing, and is driven by removing as much uncertainty as
possible about the overall scene interpretation given the evidence to
date.
- F. Fleuret and D. Geman, "Stationary features and cat detection,"
Journal of Machine Learning Research, 9:2549-2578, 2008.
(pdf)
- S. Gangaputra and D. Geman, "A design principle for coarse-to-fine classification,"
Proceedings
CVPR 2006, 2, 1877-1884, 2006.
(pdf)
- S. Gangaputra and D. Geman, "The trace model for object detection and tracking," Towards Category-Level
Object Recognition , Lecture Notes in
Computer Science, 4170, 401-420, 2006.
(pdf)
- H. Sahbi and D. Geman, "A hierarchy of support vector machines
for pattern detection," Journal of Machine Learning Research, 7, 2087-2123, 2006.
(pdf)
- Y. Amit, D. Geman and X. Fan, "A coarse-to-fine strategy for
multi-class shape detection," IEEE Trans. PAMI, 28,
1606-1621,
2004.
(pdf)
COMPUTATIONAL BIOLOGY
Research Agenda
The overall goal of our research program is to develop mathematical
and algorithmic foundations for computational systems medicine. We are
especially interested in providing diagnoses and treatments tailored
to the molecular profile of an individual's disease. The fundamental
assumption underlying the research is that diseased cells arise from
perturbations in biological networks due to the net effect of
interactions among multiple molecular agents, inherited and somatic
DNA variants, changes in mRNA, miRNA and protein expression, and
epigenetic factors such as DNA methylation. Gigantic amounts of data
about these perturbations are being collected by next-generation
sequencing and microarray experiments of large patient cohorts, making
it possible to discover the driving differences in
the abundance and activity of key biomolecules (e.g., mRNA, proteins
and metabolites). Analysis of this data will enable identification of
key reporters and biomarkers of networks states and uncover molecular
signatures of disease.
Molecular Prediction of Disease Phenotypes
Background: A major challenge in computational biology is to extract
knowledge about the genetic nature of disease from high-throughput
data. A prominent example is genotype-to-phenotype prediction from
gene microarray data. The corresponding mRNA counts provide a global
view of cellular activity by simultaneously recording the expression
levels of thousands of genes. In principle, statistical methods can
enhance our understanding of human health and genetic diseases, such
as cancer, by detecting the presence of disease, discriminating among
cancer sub-types, predicting clinical outcomes, and characterizing
disease progression.
Obstacles: Applications to biomedicine, specifically the implications
for clinical practice, are widely acknowledged to remain limited. An
important obstacle to both biological understanding and clinical
applications is the "black box" nature of the decision rules provided
by most machine learning approaches. The rules they generate lack the
convenience and simplicity desired for extracting underlying
biological meaning or transitioning into the clinic. Traditionally,
study validation and therapeutic development are based on a small
number of biomarkers (whose concentrations can then be assayed with
high-resolution methods such as RT-PCR) and require understanding of
the role of the genes and gene products in the context of molecular
pathways. Achieving biologically relevant results argues for a
different strategy.
Relative Expression Analysis: Motivated by these barriers to
translational research, we are investigating decision rules based on
the ordering among the expression values, searching for characteristic
perturbations in this ordering from one phenotype to another.
"Relative expression analysis" (RXA) is then based entirely upon the
expression ordering of a small number of genes. This provides simple
classification rules which involve very few genes and generate
specific hypotheses for follow-up studies. Moreover, RXA has the
potential to identify genomic "marker interactions" with plausible
biological interpretation and possible clinical utility.
- J.A. Eddy, J. Sung, D. Geman and N.D. Price, "Relative
expression analysisfor molecular diagnosis and prognosis,"
Technology in Cancer Research and Treatment 9, 149-159, 2010.
(pdf)
- L.B. Edelman, G. Goia, D. Geman,, W. Zhang and N.D. Price,
"Two-transcript gene expression classifiers in the diagnosis
and prognosis of human diseases,"
BMC Genomics 10:583, 2009.
(pdf)
- X. Lin, B. Afsari, L. Marchionni, L. Cope, G. Parmigiani, D. Naiman and D.Geman,
"The ordering of expression among a few genes can provide simple cancer biomarkers and signal
BRCA1 mutations," BMC Bioinformatics 10:256, 2009.
(pdf)
- L. Xu, A.C. Tan, R. L. Winslow and D. Geman, "Merging microarray data from
separate breast cancer studies provides a robust prognostic signature," BMC Bioinformatics 9:125, 2008.
(pdf)
- L. Xu, A-C Tan, D. Naiman, D. Geman and R. Winslow, "Robust prostate cancer marker genes emerge
from direct integration of inter-study microarray data," Bioinformatics 2005 21: 3905-3911.
(pdf)
- A-C Tan, D. Naiman, L. Xu, R. Winslow and D. Geman, "Simple decision rules for classifying human cancers
from gene expression profiles," Bioinformatics 2005 21: 3896-3904.
(pdf)
- D. Geman, C. d'Avignon, D. Naiman and R. Winslow, "Classifying gene
expression profiles from pairwise mRNA comparisons," Statist. Appl. in
Genetics and Molecular Biology, 3, 2004.
(pdf)
Protein Classification
Control of the packing of chromatin is essential to the
regulation of gene expression. Recent studies have evidenced the
importance of covalent modifications of histone N-terminal tails in
this control. Histones undergo methylation or acetylation reactions,
which are performed by multi-protein complexes. Some of these
complexes have been shown to play important roles in various cancers:
in leukemia for methylation reactions and in prostate, gastric and
colorectal cancers for acetylation reactions. Among the protein
domains which are found in most histone-recognition complexes are the
PHD fingers, which serve as histone lysine-modification readers. With
hundreds of sequences for PHD fingers available but only twenty-odd
known substrates and structures, we performed a sequence-based
analysis of PHD fingers in order to classify them into different
families. The analysis relies on the correlated evolution of residues
within PHD fingers, which was obtained by analyzing a multiple
sequence alignment of over 900 PHD finger sequences. Twenty-seven
'hot-spots' were detected in the alignment, as position:residue type
pairs, and were used to compare and classify the sequences. Four
families were finally obtained. Comparison of the families with
existing functional and structural data confirmed the obtained
grouping, and suggested that the sequence clustering could indeed
provide information regarding the preferred substrate for each PHD
finger. The classification could thus prove useful for an estimation
of substrate preferences for PHD fingers yet to be identified using
solely their amino-acid sequence.
- P. Slama and D. Geman, "Identification of family-determining residues
in PHD fingers," Nucleic Acids Research , 1-14, 2010.
(pdf)
Modeling Cell Signaling Networks
Very high-dimensional data sets are ubiquitous in
computational biology and raise serious challenges
in statistical learning. The difficulties are especially pronounced
when the objective is to uncover a complex statistical dependency
structure within a large number of variables and yet the number of
samples available for learning is relatively limited. This situation
reaches extremes in attemtping to reverse engineer transcriptional and
signaling networks. When measured against
the complexity of the systems being studied, the amount of data
available for modeling is minuscule. Simply penalizing model
complexity often biases estimation towards the absence of
interactions. Consequently, discovering rich structure from small
samples is nearly impossible unless the search is severely restricted,
in which case statistical learning may become feasible due to variance
reduction.
Protein signaling networks play a central
role in transcriptional regulation and the etiology of many
diseases. Statistical methods, particularly Bayesian networks, have
been widely used to model cell signaling, mostly for model organisms
and with focus on uncovering connectivity rather than inferring
aberrations. Extensions to mammalian systems have not yielded
compelling results, due likely to greatly increased complexity and
limited proteomic measurements in vivo. We have
proposed a comprehensive statistical model that is anchored to a
predefined core topology, has a limited complexity due to parameter
sharing and uses micorarray data of mRNA transcripts as the only
observable components of signaling. Specifically, we accounted for
cell heterogeneity and a multi-level process, representing signaling
as a Bayesian network at the cell level, modeling measurements as
ensemble averages at the tissue level and incorporating
patient-to-patient differences at the population level. Motivated by
the goal of identifying individual protein abnormalities as potential
therapeutic targets, we applied our method to the RAS-RAF network
using a breast cancer study with 118 patients. We demonstrated
rigorous statistical inference, established reproducibility through
simulations and the ability to recover receptor status from available
microarray data. Current work is focusing on specific applications to
cell-signaling in breast cancer.
- E. Yoruk, M. Ochs, D. Geman and L. Younes, "A comprehensive statistical model for cell signaling,"
IEEE Trans. Computational Biology and Bioinformatics , 592-606, 2011.
(pdf)
Molecular Network Regulation
A powerful way to separate signal from noise in biology is to convert
the molecular data from individual genes or proteins into an analysis
of comparative biological network behaviors. However, most existing
analyses do not take into account the combinatorial nature of gene
interactions within the network. Together with Nathan Price, James
Eddy and Leroy Hood, I am exploring a new method for analyzing gene
interactions based entirely on relative expression values of
participating genes, that is, on the ordering of expression within
pathway profiles. Our approach provides quantitative measures of how
network rankings differ either among networks for a selected phenotype
or among phenotypes for a selected network. In examining cancer
subtypes and neurological disorders, we have identified networks that
are tightly and loosely regulated, as defined by the level of
conservation of transcript ordering, and observed a strong trend to
looser network regulation in more malignant phenotypes and later
stages of disease. We also demonstrate that variably expressed
networks represent robust differences between disease states.
- J.A. Eddy, L. Hood, N.D. Price and D. Geman, "Identifying
tightly regulated and variably expressed networks by differential
rank conservation,"
PLoS Computational Biology , 2010.
(pdf)
OTHER RECENT PROJECTS
TWENTY QUESTIONS THEORY
Together with Gilles Blanchard and students at the Center for Imaging
Science, I have explored the theoretical foundations of a "twenty
questions" approach to pattern recognition. The object of analysis is
the computational process itself rather than probability distributions
(generative modeling) or decision boundaries (predictive
learning). Our formulation is motivated by applications to scene
interpretation in which there are a great many possible explanations
for the data, one ("background") is statistically dominant, and it is
imperative to restrict intensive computation to genuinely ambiguous
regions. We consider sequential testing strategies in which decisions
are made iteratively, based on past outcomes, about which test to
perform next and when to stop testing.
The key structure is a hierarchy of tests, one binary test
("question") for each cell in a tree-structured, recursive
partitioning of the space of hypotheses. A central role in the
mathematical analysis is played by the ratio of the "cost" of a test,
meaning the amount of computation necessary for execution, to the
"selectivity" of the test, meaning the probability of correcting
declaring background (i.e., not hallucinating). Our main result is
that, under a natural model for total computation and certain
statistical assumptions on the joint distribution of the tests,
coarse-to-fine strategies are optimal whenever the cost-to-selectivity
ratio of the test at every cell is less than the sum of the
cost-to-selectivity ratios for the children of the cell. As might be
expected, under mild assumptions, good designs exhibit a steady
progression from broad scope coupled with low power to high power
coupled with dedication to specific explanations.
- G. Blanchard and D. Geman, "Sequential testing designs for pattern
recognition," Annals of Statistics, 33, 1155-1202, June, 2005.
(pdf)
MENTAL IMAGE MATCHING
The standard problem in image retrieval is "query-by-visual-example":
the "query image" resides in an image database and is matched by the
system with other images. Together with Minel Ferecatu and
researchers in the IMEDIA project at INRIA, I am considering a
different scenario in which the query image or, more generally, the
target category, resides only in the mind of the user as a set of
subjective visual patterns, psychological impressions, or "mental
pictures." Consequently, since image databases available today are
often unstructured and lack reliable semantic annotations, it is often
not obvious how to initiate a search session; this is the "page zero
problem." We propose a new statistical framework based on relevance
feedback to locate an instance of a semantic category in an
unstructured image database with no semantic annotation. A search
session is initiated from a random sample of images. At each retrieval
round, the user is asked to select one image from among a set of
displayed images -- the one that is closest in his opinion to the
target class. The matching is then "mental." Performance is measured
by the number of iterations necessary to display an image which
satisfies the user, at which point standard techniques can be employed
to display other instances. We have developed a Bayesian formulation
which scales to large databases. The two key components are a response
model which accounts for the user's subjective perception of
similarity and a display algorithm which seeks to maximize the flow of
information. Experiments with real users and two databases of 20,000
and 60,000 images demonstrate the efficiency of the search process.
- M. Ferecatu and D. Geman, "A statistical framework for image category sear\
ch from a mental
picture," IEEE Trans. PAMI, 31, 1087-1101, 2009.
(pdf)
- Y. Fang and D. Geman, "Experiments in mental face retrieval," Proceedings AVBPA 2005, Lecture Notes
in Computer Science, 637-646, July 2005. (Best Student Paper Award)
(pdf)
|
|