Training and Education
Recent technical advances in large-scale sequencing and genomics methods as well as in communications have triggered a scientific revolution with immense potential for extending biological knowledge. They have also posed an immense challenge: how to make optimal use of vast quantities of biological data. Without long-term high quality mechanisms for accessing and analyzing the data, the resources used in generating the data are in danger of going to waste. My long-term goal is to discover the rules and mechanisms underlying the workings of a flowering plant (Arabidopsis thaliana) by building an infrastructure to bring all the available data together, developing computer programs that infer knowledge based on the available data, and engaging the research community to test the inferences. Towards this end, we need the following: 1) standards to code not only how much is known to what extent, but also how much is unknown; 2) a collaborative environment that allows researchers to share information and knowledge effectively; 3) systematic, multi-disciplinary approaches for generating, analyzing, and interpreting data capable of handling large-scale datasets without sacrificing data quality; and 4) multi-disciplinary approaches to develop efficient methods of inferring knowledge. This will result not only in new paradigms in plant biology but also in advancement of our knowledge to a point where we can effectively manipulate the flora to improve human health and our environment.
I have been involved several ongoing projects that address some of the needs stated above. The projects can be grouped into three categories: biological databases, bio-ontologies, and systems approaches in biology. Biological databases include a database for all information of a single organism, a database for a specific type of information (metabolism) in many species, and a database for managing and exploring literature data for any type of system of interest. Bio-ontologies include designing and building ontologies specific for particular domains of biological knowledge such as biological processes, molecular functions, cellular components of all organisms and anatomical parts and developmental stages for flowering plants. Systems approaches include two small projects in collaboration with plant biologists to address questions about specific aspects of Arabidopsis biology such as deciphering the transcriptional regulatory circuit for cold acclimation in plants and systematic determination of subcellular and tissue localization of proteins of unknown function in planta.
In addition to the projects described above, I have a personal mission to mobilize the research community to contribute to biological databases and share knowledge and expertise, to bridge the gaps of information dissemination between traditional scientific journals and biological databases, and to bridge the gap between biologists and computer scientists. I believe that the plant biology community is not taking full advantage of the recent advances in communications and technology. Through TAIR, we are creating and testing mechanisms for researchers to provide data and expertise directly to a database. I am communicating with publishers of major plant journals to share data and establish cross-references between journal websites and databases. I am also in communication with an open-access publisher to create a joint journal devoted to publishing papers that are not suitable for traditional journals such as functional genomics like microarray data, methods, and reproducible negative results. Finally, I believe that major breakthroughs in bioinformatics will come from in-depth collaborations between biology experts and computer science experts rather than from people who know a little bit of both. As an editor for Plant Physiology, I am managing the publication of bioinformatics papers in this journal in order to educate plant biologists about bioinformatics. I would be very interested in doing the converse: bringing biology papers into a computer science journal.
There are three types of biological database projects in my group, an organism-specific database (TAIR), a metabolism database (MetaCyc), and a literature curation database (PubSearch). All three projects are carried out in collaboration with other groups. Four years ago, we created TAIR (The Arabidopsis Information Resource, arabidopsis.org) in collaboration with software developers at the National Center for Genome Resources. It is a comprehensive Web-based information resource for the model plant Arabidopsis thaliana. Our primary goal was to develop a new information infrastructure containing all available genomic and genetic data and make it accessible to the public through a set of user-friendly search, browse, and visualization tools. In addition to a comprehensive database and web applications, we developed a set of standards in the semantics and syntax of the data to facilitate curation, exchange, and analysis. It is one of the most used resources for plant research today, with about 900,000 page views accessed by about 30,000 unique IP addresses per month. Currently there are 12,752 registered users and 4,745 laboratories, making our user group one of the largest organism-based biological research communities. MetaCyc (www.metacyc.org) is collaboration with Peter Karp's group at SRI international and aims to represent all experimentally studied metabolism information (including pathways, reactions, enzymes, compounds, and cellular locations) from microbes and plants in computer- and human-readable formats. It has tremendous potential for genomics (serving as a reference database for inferring metabolic pathway annotation using sequence similarity measures of the enzyme sequences), metabolic engineering (comparing metabolic pathways in different organisms), and biological databases (providing detailed, experimentally verified information for genes of interest).
For most biological databases, the literature is one of the main data sources, and significant resources are devoted to capturing this information. Our long-term goal is to develop a set of systematic procedures and tools for integrating knowledge from the confined context of a research article into the dynamic, broad context of a biological database. We have developed a literature curation tool called PubSearch (www.pubsearch.org), which stores literature, gene, functional annotation, and keyword data in a stand-alone database and allows curators to establish associations between these data types using a web browser. In collaboration with Simon Twigger's group at the Medical College of Wisconsin, we are extending PubSearch to include a literature fetching function (PubFetch) and work-tracking function (PubTrack) to create a comprehensive environment to manage the literature data.
Although biology is one of the complex systems where large bodies of knowledge exist, descriptions of rules underlying the knowledge reside in a thick semantic soup. Attempts to standardize nomenclature across organisms have essentially failed and remain a difficult task even within a single organism research community. Recently, a few model organism databases have joined forces to standardize the semantics for describing biological process, molecular function, and cellular components of all organisms (Gene Ontology (GO) Consortium, www.geneontology.org) and my group has been an integral part of this effort since 2000. Although the use of GO is becoming a standard, it has some limitations. For example, it does not accommodate anatomical parts or developmental stages of a multicellular organism. Furthermore, it does not attempt to describe traits or phenotypes. In order to accommodate the description of genes and gene products in Arabidopsis, we developed orthologous vocabulary systems for anatomical parts and developmental stages. In addition, we have established a collaboration with other plant model organism databases such as MaizeDB, Gramene, and IRRI, in a project called Plant Ontology Consortium (www.plantontology.org), to develop shared anatomy and developmental stages ontologies for flowering plants. The establishment and usage of these shared, controlled vocabularies will allow researchers to query across all organisms for knowledge and begin to address correlations between structure and function in explicit, systematic ways.
If we could obtain all the necessary facts about a biological system in computer- and human comprehensive ways, we can start to ask new questions about biology. I have two recently started projects in this category; One project aims to decipher the transcriptional regulatory network involved in cold acclimation in plants (http://aztec.stanford.edu/cold/), and the other attempts to identify the subcellular location of several hundred proteins of unknown function (http://aztec.stanford.edu/gfp/). The cold-acclimation regulatory circuit project is in collaboration with Mike Thomashow and colleagues at Michigan State U. and Oregon State U. and we are using a combination of microarray analysis, promoter analysis, phylogenetic analysis, and reverse genetics approaches in cold-acclimating plants such as Arabidopsis and barley and non-acclimating plants like rice and tomato to ask which genes are involved specifically in cold-acclimation and how the genes are transcriptionally regulated.
In an effort to systematically characterize Arabidopsis proteins with unknown function, we are collaborating with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory, David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook, and Natasha Raikhel at UC Riverside) to identify subcellular localization of approximately 800 genes that have no known function in planta (real-time images of live cells in intact plants). In addition to discovering localization patterns of these novel proteins, we are already identifying potential novel organelles and suborganelles.
FUTURE PLANS (NEXT FIVE YEARS)
In the next five years, I would like to continue the three categories of the projects (biological databases, bio-ontologies, and systems approaches) but make a transition from developing infrastructure and tools to creating applications that use the infrastructure to infer new information or identify patterns. However, I value the critical importance of maintaining and updating the resources, which will be done by professional curators and software developers. Personally, I would like to develop programs that can, for example, predict function based on the knowledge and information embedded in TAIR. Also, I am interested in analyzing the bio-ontologies and their annotations to identify any novel patterns, both regular and irregular. In addition to continuing the existing projects, I intend to initiate a couple of new projects, one on building an infrastructure for metabolomics and the other on analyzing the correlation between networking and scientific success in collaboration with social scientists.
I would like to transform TAIR into a discovery environment for all plant researchers, educators, and students. The proposed work will include a comprehensive annotation of the genome, transcriptome and proteome, including regulation and phenotype information. TAIR will provide access to all public data resulting from large-scale 'omics' research and traditional 'hypothesis-driven' research in intuitive, powerful, and highly integrated views capable of facilitating new discoveries about plant development and physiology. The project will continue to develop controlled vocabularies and standardized data exchange mechanisms for maximal interoperability with other biological databases and will provide data in explicitly defined and structured formats to facilitate programmatic data retrieval. TAIR's strong support within the plant research community will be utilized to create networks of information connecting TAIR to other plant databases, web resources for specific types of Arabidopsis information, and traditional scientific journals. In addition, TAIR's role as an essential resource in the plant research community requires that a mechanism for long-term support of the project be established. To that end, several potential ways to generate revenues will be explored.
In the next five years, we will focus on completing the plant metabolism information in MetaCyc to a golden standard such that it will effectively have replaced all the textbooks. Towards that end, we will actively solicit collaboration from the classical biochemists and other colleagues from the Society of Phytochemistry in addition to curating the data from the primary literature, reviews, and textbooks and results from functional genomics and proteomics experiments. Once the known information is complete and updated in the database, we can start to ask questions about missing information (e. g. missing enzymes, compounds, and pathways in an organism as compared to another). In addition, we should be able to ask questions about the differences and similarities between strategies taken by different organisms.
For PubSearch, I am interested in collaborating with computer scientists to incorporate methods such as Natural Language Processing for more automated literature curation. Our experience of manual and semi-manual extraction of knowledge from literature would provide a good baseline for such a collaborative project.
One of the immediate applications of bio-ontologies is in associating biological objects such as genes. This allows quantitative comparison of genes and can facilitate interoperability (querying of one database by another) if multiple databases use the same ontologies to annotate data objects. I am interested in analyzing the ontologies and their annotated data objects in TAIR, GOC, and POC databases to determine the global organization patterns of the ontologies and the genes using graph theoretical calculations. I am also interested in creating and using ontologies for complex information and will focus on describing phenotype information using multiple ontologies.
I am particularly interested in the preliminary results from the projects in the category. From the cold acclimation project, we found that transcripts are turned on in a series of waves as a function of time in cold-treated Arabidopsis. In order to group the genes into more discrete regulons (genes that are regulated by the same transcription factor(s)), I feel that we need to learn more about the potential promoter regions. Towards that end, we have gathered and curated all the experimentally verified cis-elements from biological databases. Using PubSearch, we can efficiently extract all other experimentally verified cis-elements. We will use this dataset to map the non-coding sequences of genes and intergenic regions and ask if there are any high-level patterns of cis-element compositions in the non-coding genic and intergenic sequences. If we can define promoters more precisely, then we should be able to develop algorithms to compare promoters. Using better-defined promoter information, we want to analyze the microarray data. In addition, we intend to curate all of the known cis-element/transcription factor relationships. I would like to collaborate with computer scientists interested in developing heuristic algorithms that could predict transcription factor/cis-element relationships based on the curated dataset.
The unknown protein localization project also has several venues we want to pursue. First, we want to use the experimental data as a training set to determine if we can identify any new targeting/localization signals and motifs by either using existing algorithms (e.g. TargetP) or developing new algorithms in collaboration with computer scientists. In addition, this project is just starting to produce results, and the first 1% of the unknown proteins revealed not only interesting localization patterns such as cell-type and tissue specificity, but also uncovered some novel localization patterns. Some of the novel localization patterns may be novel organelles or suborganelles previously undetected. We are set to capture localization images of 800 genes this year and have submitted a renewal to do another 4000 genes. Even within the current grant period, we will produce about 8000 images. In order to group the novel patterns into categories and analyze all of the images efficiently, we need to perform content-based image searching as well as the ability to cluster the images. I would be very interested in collaborating with computer scientists to develop such programs and use them to analyze these localization patterns.
TRAINING AND EDUCATION
While I have not been involved in teaching students in a classroom setting, I have been involved in training research scientists about the use of biological databases and bioinformatics through more than 12 workshops conducted at Stanford, Berkeley, and at international meetings. I have also been involved in training 10 students ranging from high-school to PhD candidates who have come to my lab as visiting students, interns, and curator assistants. Finally I have trained 4 postdoctoral researchers in my lab.
If I were given an opportunity to design courses for teaching students in a classroom setting, I would be interested in teaching biological concepts to computer science students. I believe each of the three categories of my research interests, biological databases, bio-ontologies, and systems approaches in biology, could easily be made into courses. In addition, I think a course on 'modern biological approaches' with a historical and futuristic perspective and examples of key successes and failures might be interesting. Also, a course on 'unsolved mysteries in biology today' that delineates some of the unexplained phenomena in biology might be fun.
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Research 31(1):224-228.
Mueller, LA., Zhang, P., Rhee SY (2003) AraCyc. A Biochemical Pathway Database for Arabidopsis. Plant Physiology 132(2):453-60.
Krieger, CJ, Zhang, P, Mueller, L, Wang, A, Paley, S, Arnaud, M, Pick, J, Rhee, SY, and Karp, P. (2004) MetaCyc: Recent enhancements to a database of metabolic pathways and enzymes in microorganisms and plants. Nucleic Acids Research 32 Database issue:D438-42.
Rhee, SY (2004) Carpe Diem. Retooling the Publish or Perish Model into the Share and Survive Model. Plant Physiol. 134(2):543-7
Bard, JL and Rhee, SY (2004) Ontologies in biology: design, applicatioins and future challenges. Nature Review Genetics 5(3):213-22.
Harris et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32 Database issue:D258-61.