How to Ship the Human Genome
Over the Internet
The human genome consists 22 autosomes and 2 sex chromsomes representing a total of about 3 billion bases of raw sequence.
Here is about
2040 bytes of raw human sequence...
1 ggtggcgcga gcttctgaaa ctaggcggca gaggcggagc cgctgtggca ctgctgcgcc
61
tctgctgcgc ctcgggtgtc ttttgcggcg gtgggtcgcc gccgggagaa gcgtgagggg
121 acagatttgt
gaccggcgcg gtttttgtca gcttactccg gccaaaaaag aactgcacct
181 ctggagcgga
cttatttacc aagcattgga ggaatatcgt aggtaaaaat gcctattgga
241 tccaaagaga
ggccaacatt ttttgaaatt tttaagacac gctgcaacaa agcagattta
301 ggaccaataa
gtcttaattg gtttgaagaa ctttcttcag aagctccacc ctataattct
361 gaacctgcag
aagaatctga acataaaaac aacaattacg aaccaaacct atttaaaact
421 ccacaaagga
aaccatctta taatcagctg gcttcaactc caataatatt caaagagcaa
481 gggctgactc
tgccgctgta ccaatctcct gtaaaagaat tagataaatt caaattagac
541 ttaggaagga
atgttcccaa tagtagacat aaaagtcttc gcacagtgaa aactaaaatg
601 gatcaagcag
atgatgtttc ctgtccactt ctaaattctt gtcttagtga aagtcctgtt
661 gttctacaat
gtacacatgt aacaccacaa agagataagt cagtggtatg tgggagtttg
721 tttcatacac
caaagtttgt gaagggtcgt cagacaccaa aacatatttc tgaaagtcta
781 ggagctgagg
tggatcctga tatgtcttgg tcaagttctt tagctacacc acccaccctt
841 agttctactg
tgctcatagt cagaaatgaa gaagcatctg aaactgtatt tcctcatgat
901 actactgcta
atgtgaaaag ctatttttcc aatcatgatg aaagtctgaa gaaaaatgat
961 agatttatcg
cttctgtgac agacagtgaa aacacaaatc aaagagaagc tgcaagtcat
1021 ggatttggaa
aaacatcagg gaattcattt aaagtaaata gctgcaaaga ccacattgga
1081 aagtcaatgc
caaatgtcct agaagatgaa gtatatgaaa cagttgtaga tacctctgaa
1141 gaagatagtt
tttcattatg tttttctaaa tgtagaacaa aaaatctaca aaaagtaaga
1201 actagcaaga
ctaggaaaaa aattttccat gaagcaaacg ctgatgaatg tgaaaaatct
1261 aaaaaccaag
tgaaagaaaa atactcattt gtatctgaag tggaaccaaa tgatactgat
1321 ccattagatt
caaatgtagc acatcagaag ccctttgaga gtggaagtga caaaatctcc
1381 aaggaagttg
taccgtcttt ggcctgtgaa tggtctcaac taaccctttc aggtctaaat
1441 ggagcccaga
tggagaaaat acccctattg catatttctt catgtgacca aaatatttca
1501 gaaaaagacc
tattagacac agagaacaaa agaaagaaag attttcttac ttcagagaat
1561 tctttgccac
gtatttctag cctaccaaaa tcagagaagc cattaaatga ggaaacagtg
1621 gtaaataaga
gagatgaaga gcagcatctt gaatctcata cagactgcat tcttgcagta
1681 aagcaggcaa
tatctggaac ttctccagtg gcttcttcat ttcagggtat caaaaagtct
1741 atattcagaa
taagagaatc acctaaagag actttcaatg caagtttttc aggtcatatg
1801 actgatccaa
actttaaaaa agaaactgaa gcctctgaaa gtggactgga aatacatact
1861 gtttgctcac
agaaggagga ctccttatgt ccaaatttaa ttgataatgg aagctggcca
1921 gccaccacca
cacagaattc tgtagctttg aagaatgcag gtttaatatc cactttgaaa
1981 aagaaaacaa
ataagtttat ttatgctata catgatgaaa cattttataa aggaaaaaaa
At this font
size over 1.4 million
web pages such as this would be needed to display the whole genome.
Moreover, the resulting display would appear meaningless without accompanying
annotation and a mapping framework in which to orient oneself . For instance,
one of the 30,000 genes
in the human genome starts somewhere in this sequence, but where?
The mystery will be solved latter on.
The human genome may be interpreted
using a set of maps
and annotations.
These maps and annotations, which include the primary sequence data, describe
our understanding of the human genome. To devise an efficient strategy
for shipping the human genome, we must take a look at the volumes of information
that need to be transferred.
Most of it is in the various maps of the human genome. Within the Maps category, most of the data is found in the sequence maps.
Maps
Map Type Megabytes of Data Involved
Sequence
5674
Clone
350
Contig
2870
GenBank
4430
Gene_Sequence
280
STS
27
Variation
560
Cytogenetic
64
Genes_Cytogenetic
0.840
Morbid
63
Genetic
Linkage
0.180
Genethon
0.060
Marshfield
0.120
Radiation
Hybrid
2.77
GeneMap99-G3
0.185
GeneMap99-GB4
1.26
NCBI RH
0.720
Stanford G3
0.185
Whitehead-RH
0.420
Annotations of Genes
84
LocusLink; a catalog of genes
63
RefSeq; reference sequences
21
The bulk of the data represented
by the human genome is found as sequence data, yet this is the data
type which is least comprehensible to a researcher and is relatively information-poor.
On the other hand, the data found in the non-sequence maps and annotations
is readily comprehensible and information-rich.
This latter data is, of course, derived from the human genome sequence
data, but has been "compressed"
by analysis. A good strategy for shipping the human genome would be to
maximize the transfer of the most compact, comprehensible data and mimize
the transfer of least comprehensible, least compact data. This is
the philosophy behind the Human Genome Map
Viewer.
And yet, some want it all no matter the cost!
The Human Genome is our favorite,
but there are hundreds of others!
GenBank Release
122 Statistics
80,000 Species
10 million DNA sequences
12 Billion bases, or characters of information
43 Gigabytes
of sequence and annotations
NCBI has been connected to the Internet2 System for over a year: 36% of our web traffic is now via Internet2
Most of this 36% represents FTP downloads of the GenBank DNA sequence database.
Download Activity
270
Gigabytes/day via FTP
163
Gigabytes/day via Web
400 Gigabytes/day shortly after a new GenBank release
Mirror GenBank FTP site at The
San Diego Super Computing Center: ftp://genbank.sdsc.edu/
Initial Map Viewer Search Screen
HTML Size: 0.008 Megabytes
Scope of data: 5800 Megabytes
Initally, the scope of the data
is huge but the amount transferred is negligible.
Initial View
of the BRCA2 Gene involved in Breast Cancer
HTML Size: .060 Megabytes
Scope of data: 3.0 Megabytes
The scope of the data is much
greater than the amount transferred.
View with additional
Maps Added
HTML Size: 0.139 Megabytes
Scope of Data: 3.001 Megabytes
The scope of the data still exceeds
the amount transferred.
HTML Size: 0.139 Megabytes
Scope of Data: 0.100 Megabytes
At the level of a single gene,
the scope of the data displayed is roughly equal to the data transferred
over the internet.
HTML Size: 0.03 Megabytes
Scope of Data: 0.002 Megabytes
Now, at the sequence level, the
transfer of information becomes inefficient since graphics are used to
depict a rather small amount of textual information.