The primary product of the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov) is the GenBank DNA sequence database which is made available over the internet free of charge.  GenBank began in 1982 with a data base of 606 DNA sequences.  When NCBI took responsibility for the maintenance of GenBank in 1988, it contained 20579 sequences.  GenBank has continued on its exponential growth trajectory for the past decade and shows no signs of slowing down.
 
 
 
 
 
 


How to Ship the Human Genome Over the Internet
 
 
 


 

The human genome consists  22 autosomes and 2 sex chromsomes representing a total of about 3 billion bases of raw sequence.

Here is about 2040 bytes of raw human sequence...
          1 ggtggcgcga gcttctgaaa ctaggcggca gaggcggagc cgctgtggca ctgctgcgcc
       61 tctgctgcgc ctcgggtgtc ttttgcggcg gtgggtcgcc gccgggagaa gcgtgagggg
      121 acagatttgt gaccggcgcg gtttttgtca gcttactccg gccaaaaaag aactgcacct
      181 ctggagcgga cttatttacc aagcattgga ggaatatcgt aggtaaaaat gcctattgga
      241 tccaaagaga ggccaacatt ttttgaaatt tttaagacac gctgcaacaa agcagattta
      301 ggaccaataa gtcttaattg gtttgaagaa ctttcttcag aagctccacc ctataattct
      361 gaacctgcag aagaatctga acataaaaac aacaattacg aaccaaacct atttaaaact
      421 ccacaaagga aaccatctta taatcagctg gcttcaactc caataatatt caaagagcaa
      481 gggctgactc tgccgctgta ccaatctcct gtaaaagaat tagataaatt caaattagac
      541 ttaggaagga atgttcccaa tagtagacat aaaagtcttc gcacagtgaa aactaaaatg
      601 gatcaagcag atgatgtttc ctgtccactt ctaaattctt gtcttagtga aagtcctgtt
      661 gttctacaat gtacacatgt aacaccacaa agagataagt cagtggtatg tgggagtttg
      721 tttcatacac caaagtttgt gaagggtcgt cagacaccaa aacatatttc tgaaagtcta
      781 ggagctgagg tggatcctga tatgtcttgg tcaagttctt tagctacacc acccaccctt
      841 agttctactg tgctcatagt cagaaatgaa gaagcatctg aaactgtatt tcctcatgat
      901 actactgcta atgtgaaaag ctatttttcc aatcatgatg aaagtctgaa gaaaaatgat
      961 agatttatcg cttctgtgac agacagtgaa aacacaaatc aaagagaagc tgcaagtcat
     1021 ggatttggaa aaacatcagg gaattcattt aaagtaaata gctgcaaaga ccacattgga
     1081 aagtcaatgc caaatgtcct agaagatgaa gtatatgaaa cagttgtaga tacctctgaa
     1141 gaagatagtt tttcattatg tttttctaaa tgtagaacaa aaaatctaca aaaagtaaga
     1201 actagcaaga ctaggaaaaa aattttccat gaagcaaacg ctgatgaatg tgaaaaatct
     1261 aaaaaccaag tgaaagaaaa atactcattt gtatctgaag tggaaccaaa tgatactgat
     1321 ccattagatt caaatgtagc acatcagaag ccctttgaga gtggaagtga caaaatctcc
     1381 aaggaagttg taccgtcttt ggcctgtgaa tggtctcaac taaccctttc aggtctaaat
     1441 ggagcccaga tggagaaaat acccctattg catatttctt catgtgacca aaatatttca
     1501 gaaaaagacc tattagacac agagaacaaa agaaagaaag attttcttac ttcagagaat
     1561 tctttgccac gtatttctag cctaccaaaa tcagagaagc cattaaatga ggaaacagtg
     1621 gtaaataaga gagatgaaga gcagcatctt gaatctcata cagactgcat tcttgcagta
     1681 aagcaggcaa tatctggaac ttctccagtg gcttcttcat ttcagggtat caaaaagtct
     1741 atattcagaa taagagaatc acctaaagag actttcaatg caagtttttc aggtcatatg
     1801 actgatccaa actttaaaaa agaaactgaa gcctctgaaa gtggactgga aatacatact
     1861 gtttgctcac agaaggagga ctccttatgt ccaaatttaa ttgataatgg aagctggcca
     1921 gccaccacca cacagaattc tgtagctttg aagaatgcag gtttaatatc cactttgaaa
     1981 aagaaaacaa ataagtttat ttatgctata catgatgaaa cattttataa aggaaaaaaa

At this font size over 1.4 million web pages such as this would be needed to display the whole genome.  Moreover, the resulting display would appear meaningless without accompanying annotation and a mapping framework in which to orient oneself . For instance, one of the 30,000 genes in the human genome starts somewhere in this sequence, but where?  The mystery will be solved latter on.
 

The human genome may be interpreted using a set of maps and annotations.  These maps and annotations, which include the primary sequence data, describe our understanding of the human genome.  To devise an efficient strategy for shipping the human genome, we must take a look at the volumes of information that need to be transferred.
 

Where is the burden of data?

Most of it is in the various maps of the human genome.  Within the Maps category, most of the data is found in the sequence maps.

Maps

   Map Type                  Megabytes of Data Involved

   Sequence                             5674
        Clone                            350
        Contig                          2870
        GenBank                         4430
        Gene_Sequence                    280
        STS                               27
        Variation                        560

   Cytogenetic                            64
        Genes_Cytogenetic                  0.840
        Morbid                            63

   Genetic Linkage                         0.180
        Genethon                           0.060
        Marshfield                         0.120

   Radiation Hybrid                        2.77
        GeneMap99-G3                       0.185
        GeneMap99-GB4                      1.26
        NCBI RH                            0.720
        Stanford G3                        0.185
        Whitehead-RH                       0.420
 

Annotations of Genes                      84
        LocusLink; a catalog of genes          63
        RefSeq; reference sequences             21
 
 

Where is the information?
 

The bulk of the data represented by the  human genome is found as sequence data, yet this is the data type which is least comprehensible to a researcher and is relatively information-poor.  On the other hand, the data found in the non-sequence maps and annotations is readily comprehensible and information-rich.   This latter data is, of course, derived from the human genome sequence data, but has been "compressed" by analysis. A good strategy for shipping the human genome would be to maximize the transfer of the most compact, comprehensible data and mimize the transfer of least comprehensible, least compact data.  This is the philosophy behind the Human Genome Map Viewer.
 
 

And yet, some want it all no matter the cost!

The Human Genome is our favorite, but there are hundreds of others!
 
 
 
 
 
 
 
 
 
 

GenBank Release 122 Statistics
 

80,000 Species

10 million DNA sequences

12 Billion bases, or characters of information

43 Gigabytes of sequence and annotations
 
 

NCBI has been connected to the Internet2 System for over a year: 36% of our web traffic is now via Internet2

Most of this 36% represents FTP downloads of the GenBank DNA sequence database.

Download Activity

270 Gigabytes/day via FTP
163 Gigabytes/day via Web

400 Gigabytes/day shortly after a new GenBank release

Mirror GenBank FTP site at The San Diego Super Computing Center: ftp://genbank.sdsc.edu/
 

Initial Map Viewer Search Screen

HTML Size: 0.008 Megabytes

Scope of data: 5800 Megabytes
 
 

Initally, the scope of the data is huge but the amount transferred is negligible.
 
 
 
 
 
 
 
 
 
 

Initial View of the BRCA2 Gene involved in Breast Cancer
 

HTML Size: .060 Megabytes

Scope of data: 3.0 Megabytes
 
 

The scope of the data is much greater than the amount transferred.
 
 
 
 
 
 
 
 
 

View with additional Maps Added
 

HTML Size: 0.139 Megabytes

Scope of Data: 3.001 Megabytes
 
 

The scope of the data still exceeds the amount transferred.
 
 
 
 
 
 
 
 
 
 
 
 

View Showing 100K of Sequence
 
 

HTML Size: 0.139 Megabytes

Scope of Data: 0.100 Megabytes

At the level of a single gene, the scope of the data displayed is roughly equal to the data transferred over the internet.
 
 
 

 
 
 
 
 
 
 
 

Finally, the Sequence Itself!

HTML Size: 0.03 Megabytes

Scope of Data: 0.002 Megabytes
 

Now, at the sequence level, the transfer of information becomes inefficient since graphics are used to depict a rather small amount of textual information.