Developing the infrastructure for accessing and analyzing data in post-genomic era of biological research.
The ways in which data are generated and published in biology are undergoing rapid change. The Internet and Web browsers have allowed researchers to obtain information from databases using Web browsers via the Internet. Similarly, the technology to obtain and analyze data is becoming higher throughput and it is now possible to obtain the complete genome sequence of a multicellular organism in a matter of a few months. In addition, obtaining the expression profile of an entire genome is now possible using microarray technologies. These advances will no doubt accelerate advances in other areas of biological research and change some ways in which we do research. These advances will need to be accompanied by similar advances in several complementing areas in order for the biological research community to move into the next generation of formulating questions and solving problems. These areas include: 1) development of the infrastructure to house and analyze these data in a way that comprehensive analyses from any perspective is facile; 2) advancement in systematic ways of conducting experiments; and 3) development of tools to reiterate data analysis, hypothesis generation, and systematic experimentation in an efficient manner.
I am working on a project to develop the infrastructure to house and analyze comprehensive data for Arabidopsis thaliana in collaboration with colleagues at the Department of Plant Biology and the National Center for Genome Resources in Santa Fe, NM. This project is called the Arabidopsis Information Resource (TAIR). Arabidopsis is the first flowering plant whose genome is sequenced (end of year 2000) and there are now an estimated 7000 researchers around the world working on this plant. In addition, there are a number of collaborative efforts to apply systematic ways of identifying functions of all genes in Arabidopsis. The rationale for intensive investigation of Arabidopsis is that it is an excellent model for the 250,000 species of higher plants. In order to maximize the use of knowledge gathered for this plant, there is a need for a comprehensive database, information retrieval and analysis system that will provide user-friendly access to all information about the organism.
At TAIR, we have taken the initial steps toward realizing these goals by developing a database structure and ways to access, visualize, and analyze the data. The basic structure of the database includes relationships among data objects (clones, genes, sequences, genetic markers, polymorphisms, transcripts, etc.), annotation (function, map position, expression, etc.), and attribution (source of data, update history, and references) of the data objects. The underlying database uses an industry standard relational database management system (Sybase). The basic functionality of data access includes searching, browsing, updating, and downloading the data. In addition, we developed tools to graphically visualize comprehensive map information for Arabidopsis (TAIR MapViewer) and sequence annotation of the completed genome (TAIR SequenceViewer) using Java Servlets technology. The major thrust in the upcoming year is to: 1) add richness to the database by populating it with curated data, currently in substantially variable formats and sources; 2) reiterate the process of database and user interface development to include the new types of data coming from functional genomics projects; and 3) improve ways of accessing, visualizing, and analyzing increasingly complex data types in more integrated ways. Toward these ends, we are developing guidelines for nomenclature and controlled vocabulary to name and describe data objects.