 EVERYTHING'S COMING up roses! Or more precisely, Arabidopsis thaliana, a somewhat nondescript white flower selected by biologists as the model in the plant kingdom for genetic research. Related to broccoli and cauliflower, A. thaliana is the most studied plant in human history; every week new papers are published about its properties. In an astounding leap forward, the journal Nature just announced that the entire genome for A. thaliana has been sequenced -- a first for the vegetal world. As plants go, it was a relatively easy task; the complete genome runs to 120 megabytes of information, compared to 1.6 gigabytes for wheat and a hefty 3 gigabytes for humanity. What makes this discovery just a bit different from the ever-increasing flow of genetic revelations is that, in another first, Nature has announced that all genomic information presented in its pages -- and on its Web site -- will be published in GEML, or Gene Expression Markup Language, a lingua franca defining a common standard for the bits of life. Why the need for standards? We need only look back at the roots of the current information age to understand the power of a common standard. Back in 1990, a young and naive Tim Berners-Lee went to a hypertext conference at Versailles, hoping to garner support for a still nascent set of standards for hypertext information interchange. He found a community -- if you could call it that -- of squabbling companies, each with their own "correct" approach to hypertext, none of them able to work with the others. Hypertext had been around for nearly thirty years -- since Ted Nelson began to work on Project Xanadu -- but had gone nowhere, because this "insanely great" idea had inspired only competitiveness, avarice, and arrogance. Berners-Lee left the conference disappointed, but he succeeded in convincing the powers-that-be at CERN, the gigantic European atomic accelerator, to release his HTML and HTTP protocols freely, as an open standard. (Tony Parisi and I took a similar approach with our VRML standard in the mid-nineties.) Thus was the World Wide Web born. It succeeded because it provided a common platform to answer the unprecedented, built-up demand to use computers and their ever-expanding networks as shared resources. Such is the power of open standards. WHAT IS GEML exactly? It's a DTD (Document Type Definition) for the common expression of genetic information. Those of you who have done any Web design are likely familiar with another DTD -- HTML -- and its "tags," those little bits of formatting information enclosed by the "< >" symbols. In HTML there are tags such as "TITLE" (which gives a page its title), "B" (for bold), "IMG" (for images) and so forth. GEML has its own tags, which define the kinds of data that interest geneticists. Here's a bit of example GEML:  While all of this is fairly unreadable -- even by geneticists -- it is easily read by a computer, and it might even look vaguely familiar if you've taken a peek at raw HTML. The "reporter" tag defines a sequence of base pairs (the four amino acids that comprise DNA) -- TACAGTGTCAGAATTAACTGTAGTC -- as having been identified in a particular section of a gene, that it's "feature" 6879 of that gene, in a specific position, and that the gene's "name" is "T89593". GEML can also reference the database of genomic data from which this gene has been extracted, identify the species (in this case, Homo sapiens), and even the "company" that lays claim to the gene.
1 2
Next
Will GEML appreciably facilitate the exchange of genetic info? Share your thoughts in the Loop.
|