RESEARCH GOALS AND ACCOMPLISHMENTS
My goal is to build an infrastructure that allows researchers to share information and knowledge in order to identify new insights and facilitate the process of generating new paradigms in biology. A long-term goal is to systematically delineate what is known and unknown in order to mobilize the research community to solve the rules underlying the workings of an organism.
One of the most efficient ways of solving problems in biology lies in the use of model organisms or systems in which the basic rules are uncovered and applied to more diverse sets of organisms and problems. For higher plants, Arabidopsis thaliana has been adopted as a model organism due to its small genome size, self-compatibility, and short generation time. Since its adoption as a model organism, many tools have been developed for this plant, including facile and efficient methods of transformation, complete genome sequence, and high-density genetic maps. Capturing and representing biological knowledge from studies using Arabidopsis thaliana is the subject of my research. More specifically, my group has developed a computer-based infrastructure to capture the research community information and the knowledge generated in the research literature and developed a query/analysis/visualization system to allow researchers to identify correlations in the information. In the future, we would like to develop a knowledge-capture system to bring the research findings directly into the computer infrastructure, and develop a simulation system that can predict an accurate outcome of any scenarios that may occur in the plant.
The Arabidopsis Information Resource
PubSearch: A Comprehensive Literature Extraction and Curation System
Gene Ontology consorium and Plant Ontology consortium
I. The Arabidopsis Information Resource (TAIR): A Comprehensive Infrastructure for Arabidopsis Biology Information
The most amount of knowledge resides in the minds of individual researchers and their laboratories. Some of this knowledge is refined in a form of publication. With approximately 11,000 researchers and 4,000 laboratories around the world, Arabidopsis research community is arguably the largest body of a model organism research community to date, with a possible exception of the human biology research community. Drosophila melanogaster, an insect that has been the subject of genetic research for almost 100 years (history of more than five-fold of that for Arabidopsis), has about half of the size of Arabidopsis community, at about 5,000 researchers.
In order to capture the knowledge from this large body of research community, we need to develop an infrastructure that allows researchers to find and share the information and knowledge generated. Advancement of computer science and communications technology has established the internet to be the most efficient medium for exchanging knowledge. In addition, advancement of high-throughput technology such as sequencing and microarray methods have allowed biologists to produce large quantities of data. Developing an infrastructure to house and make accessible these large quantities of data has been a problem for many research communities. In collaboration with information technology scientists at the National Center for Genome Resources in Santa Fe, New Mexico, my group has been engaged in developing an infrastructure to house the vast quantities of information for Arabidopsis. The infrastructure is called the Arabidopsis Information Resource (TAIR, http://arabidopsis.org), which is accessible via commonly used web browsers and can be searched and downloaded in a number of ways. For example, researchers can identify genes or proteins of interest based on many parameters (e.g. subcellular localization, expression patterns, or mutant phenotypes) from the text-based search forms, sequence analysis tools, or bulk query forms. SeqViewer (http://arabidopsis.org/servlets/sv) allows visualization of these genes on the genome decorated with clones, transcripts, genetic markers and polymorphisms. The SeqViewer interactively displays the genome from the whole chromosome down to 10 kb of nucleotide sequence. Alternatively, researchers can visualize these genes mapped on metabolic pathways from the whole cell level down to individual reactions along with metabolic compound structures using AraCyc (http://arabidopsis.org/tools/aracyc). Upon finding relevant information about genes, researchers can order associated DNA or seed stocks from the Arabidopsis Biological Resource Center (ABRC, http://arabidopsis.org/arbrc). Detailed, and up-to-date information about the database content as well as its usage statistics can be found online (http://arabidopsis.org/about).
TAIR uses an object-oriented approach to data representation and software architecture. The underlying database is implemented in a relational database management system (Sybase version 11.9.2). The data is organized in a hierarchical structure where a parent table groups a set of child tables with similar attributes and each node can be linked to other nodes and tables. At the top of the data hierarchy is the TairObject class, which is linked to other top parent classes such as Attribution (source of the data), Reference (experimental evidence source), and Annotation (descriptive information). Thus, the Attribution, Reference and Annotation classes constitute the meta data of all TAIR objects. This design has the advantage of allowing easy expansion of new data types as well as flexibility and minimization of linking tables. More detailed information about the database schemas and documentation can be found online (http://arabidopsis.org/search/schemas.html).
TAIR software is developed in a client-server mode using the JAVA Servlet technology. All applications are accessible to users by common web browsers to accommodate maximum user platform and software (operating system) diversity. Software for accessing the database is developed using an object-oriented architecture. A set of JAVA classes called TAIR Foundation Classes serve a number of functions to the front-end applications that use JAVA Server Pages. Documentation of the TAIR Application Program Interface can be found on 'About TAIR’ section of the home page. A set of bulk download tools based on flat files use CGI scripts written in Perl. Finally a number of weekly updated, static HTML pages serve relevant Arabidopsis and external links information to the community.
This project, in its third year, is accessed by about 20,000 unique internet addresses per month. Approximately 2.5 million hits and 500,000 web pages are accessed by researchers around the globe every month. TAIR is currently the most visible Arabidopsis project. For example, when using the word `Arabidopsis’ on Google (http://google.com), TAIR is on top of the list.
II. PubSearch: A Comprehensive Literature Extraction and Curation System
Peer-reviewed research articles remain the best medium for representing and disseminating the refinement of scientific knowledge. For any model organism database (MOD), the literature is one of the main data sources, and significant resources are devoted to capturing this information. Our long-term goal is to develop a set of systematic procedures and tools for integrating knowledge from the confined context of a research article into the dynamic, broad context of a model organism database.
We have developed a literature curation tool called PubSearch, which stores literature, gene, functional annotation, and keyword data in a stand-alone database and allows curators to establish associations between these data types using a web browser. In PubSearch, first-pass associations between terms (gene names and keywords) and articles are made automatically by a string matching program that indexes terms to articles. Commonly occurring words such as AND, THE, IF (stop words) are filtered out to minimize meaningless associations from being stored. For terms with a higher signal-to-noise ratio, curators verify the matches via the web browser user interface.
PubSearch uses a simple database schema in a MySQL database management system (DBMS) (version 3.21), which can be queried and updated using a password-protected login mechanism via the internet using a web-browser. The middleware is written in Java (version 1.3) and uses Java Servlet and Java Server Page (JSP) technology. The system is currently running on a Linux RedHat7.2 system with Tomcat (version 4.0) as the servlet engine. A demo of the current version of this tool and its documentation can be accessed from:
Username: demo Password: demo
The tool has been used and refined for the past 6 months by 7 curators at TAIR and 5 Arabidopsis curators at the Institute for Genome Resources (TIGR) to curate over 12,000 articles. The tool is much more convenient and user-friendly than our old system involving flat files and our curation work has become much more efficient as a result.
In addition to providing curators with a sophisticated tool to facilitate literature curation, this project impacts three bodies of the research community significantly. First, the Arabidopsis research community benefits from access to accurate and consistent annotations of data objects from the literature, which are produced in a fast, efficient manner. Second, researchers engaged in high throughput genomic projects benefit by having access to reliable, high quality annotations that can be used to enhance automated annotations. Often sequence comparison is used to predict the potential function of genes and gene products in a newly sequenced organism; accurate and detailed descriptions of a model genome and its complements will improve the accuracy of the newly sequenced organism’s annotation. Third, members of the computer science research community can use the rules, methods and curated data to develop more sophisticated and accurate algorithms to extract and analyze data from the literature. The set of human-curated data along with explicit rules used for the annotations will provide much-needed test data sets for developing and improving algorithms based on methods such as natural language processing and machine learning. This final application of the tool lends the possibility that manual curation of literature can be infinitely reduced, allowing our curation teams the freedom to use their scientific training to explore and question the data collected in MODs leading to new hypotheses and potential discoveries.
III. Gene Ontology Consortium and Plant Ontology Consortium: Establishing systematic ways of describing biology for all organisms in both human and machine-readable forms
Although biology is one of the complex systems where large bodies of knowledge exist, descriptions of rules underlying the knowledge reside in a thick semantic soup. Attempts to standardize nomenclature across organisms have essentially failed and remain a difficult task even within a single organism research community. Recently, a few model organism databases (yeast, mouse, and Drosophila) have joined forces to standardize the semantics with which to describe the roles of genes and gene products (Gene Ontology (GO) Consortium, http://www.geneontology.org) and my group has been an integral part of this effort since 2000. GO attempts to describe the roles of genes and gene products in three large aspects: molecular function, biological process, and anatomical parts. Controlled vocabularies within each of these three aspects are structured in directed acyclic graphs (DAG), which allow multiple parent-child relationships for each vocabulary. Two types of parent-child relationship 'is a’ and 'part of’, currently exist in GO. Since joining this group, we have added over 500 terms relevant for plants as well as restructuring about 400 terms within the ontologies to better reflect plant biology. We have collectively developed over 12,000 terms. This project has been well-received by the biology community and is currently used by over 10 large databases around the world, including SWISS-PROT and TIGR, and is being implemented into MEDLINE.
Although the use of GO is becoming a standard, it has some limitations. For example, it does not accommodate anatomical parts or developmental stages of a multicellular organism. Furthermore, it does not attempt to describe traits or phenotypes. In order to accommodate the description of genes and gene products in Arabidopsis, we developed orthologous vocabulary systems for anatomical parts and developmental stages, in collaboration with Jonathan Clarke at John Innes Centre, UK. In addition, we have established a collaboration with other plant model organism databases such as MaizeDB, Gramene, and IRRI, in a project called Plant Ontology Consortium, to develop shared anatomy and developmental stages ontologies. In this project, Arabidopsis vocabularies have been used as the baseline onto which terms from other plants have been added and the structures modified with a goal to accommodate the description of all plant genes and gene products.
The establishment and usage of these shared, controlled vocabularies will allow researchers to query across all organisms for knowledge and begin to address correlations between structure and function in explicit, systematic ways.
FUTURE PLANS IN THE NEXT FEW YEARS
I. Enhancement of TAIR schema and content
Currently the information in TAIR is heavily focused on the finished genome and its gene complements. In the next few years, we would like to enhance the structure of the TAIR database to represent more information about gene products. These include genetic, physical, and regulatory relationship between genes and gene products. In addition, the relationship between genotype (polymorphism in a sequence) to phenotype (of a germplasm harboring the polymorphism(s)) will be established. Finally, more derived relationships of genes and gene products will be stored; these include gene family information based on phylogenetic analysis, expression clusters based on microarray data analysis, and metabolic pathway groupings based on enzymatic assays.
II. Enhancement of TAIR’s query and data input systems
Most of the initial efforts on the TAIR project went into developing a database structure to store complex data types and relationships to represent Arabidopsis biology. In addition, a set of sophisticated query and data retrieval software has been implemented. However, current set of query tools do not reflect the underlying complexity of the database structure. In the next few years, we will focus on developing a comprehensive set of query tools that allow researchers to perform and get access to any combinations and correlations of data stored in TAIR. In effect, we will be developing a user interface for researchers to design and execute Structured Query Language (SQL) to the TAIR database.
In addition, we will develop a set of data entry and update tools to allow researchers to add and update any information in the database. Currently, we have an interactive data entry system only for person or organization profile information. We plan on expanding this to allow researchers to add information about genetic markers, genes, proteins, microarray experiments, etc. In addition, we will implement a system to allow a researcher to attach his or her own comments to any information at TAIR. Our long-term goal is to establish TAIR as an essential communication and research tool whereby it is the first place a researcher should go to find out about any aspect of Arabidopsis biology. Some aspect of in-house curation will always be essential but we hope to disperse some of the curation responsibilities to those researchers that have generated the data and thus create a co-operative resource.
III. Expansion of TAIR for plant researchers
Because the value of Arabidopsis derives from its utility in understanding other plants, our goal is to build an infrastructure that permits facile high resolution linking of specific information about Arabidopsis to similar information in all other plants (and vice versa).
Ultimately, our goal is to provide the common vocabulary, visualization tools, and information retrieval mechanisms that permit integration of all knowledge about Arabidopsis into a seamless whole that can be queried from any perspective. Of equal importance for plant biologists, the ideal TAIR will permit a user to use information about one organism to develop hypotheses about less well-studied organisms. In the next few years, we hope to develop user-friendly tools that permit an individual working outside this model species to formulate a query based on their organism of interest, have that query directed to the relevant knowledge in Arabidopsis, and present the information in a way that can be understood by any plant biologist. We will be making efforts to cross-link information in TAIR with information about other plants and organisms in other databases. In addition, we will develop a more comprehensive help system to allow researchers not familiar with Arabidopsis to use the information in TAIR more effectively.
IV. Dissecting the unknown in Arabidopsis
Sequencing the genome revealed the extent of gaps of our knowledge about Arabidopsis. Approximately 27000 genes (and 2000 pseudogenes) have been predicted based on gene prediction programs and sequence comparisons. Of these, approximately 30% have evidence of transcription (e.g. ESTs available) but are not similar to any genes of known function. About 10-15% of the genes do not even have any evidence of transcription (termed 'hypothetical’). In addition, approximately 1% of the genes have experimental evidence for subcellular location.
In an effort to systematically characterize the unknown, we are collaborating with four cell biology labs (David Jackson at Cold Spring Harbor Laboratory, David Ehrhardt at Carnegie Institution, Vitaly Cytovsky at SUNY Stoneybrook, and Natasha Raikhel at UC Riverside) to identify subcellular localization of approximately 800 genes that have no known function, not similar to any known genes, and have no localization information. The selected genes with their 5’ and 3’ intergenic regions will be PCR-amplified, fused to GFP, and the transgenic plants harboring the clones will be examined for subcellular localization. Our role will be to develop a Laboratory Information Management System (LIMS) to store and prioritize the candidate genes for cloning based on a number of criteria (including annotation download from TAIR, existence of full-length cDNA, etc.), track the status of the cloning, upload the preliminary results for internal discussions, and export the data to TAIR and other public repositories. In addition, the experimental results from this study will be used to identify potential novel signal peptides and improve subcellular localization prediction algorithms.
V. Education and outreach to scientists, educators, and general public
We plan on expanding the resources at TAIR for education and outreach. First, we will provide educational resources for high school and undergraduate-level teachers (e.g. curricula, protocols, professional development materials) engaged and interested in teaching plant courses and laboratories. In addition to gathering these materials ourselves, we will implement an online submission form for teachers and scientists to submit useful, classroom-tested protocols. Second, we will establish a community of teachers and scientists by setting up a mailing list and actively recruiting members from the scientific community to be involved as advisors for the teachers. Third, we are developing a set of extensive help pages, glossary, and tutorials for the resources available at TAIR, to facilitate high school and undergraduate-level teachers and students in using TAIR for their projects. This aspect of the project will be enhanced by collaborations with teachers who are interested in developing courses that use TAIR. We are currently in discussion with a couple of local high school and community college teachers.