Data Organization and Format
The data for the working draft are organized hierarchically by chromosome and by the
sequenced-clone contigs within each chromosome. At the top level there are 25 folders; 22 of these
are for the numbered chromosomes (autosomes), folders X and Y are for the sex chromosomes, and Un
is for clone contigs that cannot be placed confidently on a chromosome. Each of the 25 chromosomal
folders contains a separate clone contig folder for each of the clone contigs for that
chromosome.
There are two primary files in each clone contig folder; these have suffixes .fa and .agp
respectively. The .fa files gives the working draft sequence for the clone contig. The format is
fasta format:
>NT_077768
GAATTCTCTGTAACACTAAGCTCTCTTCCTCAAAACCAGAGGTAGATAGA
ATGTGTAATAATTTACAGAATTTCTAGACTTCAACGATCTGATTTTTTAA
ATTTATTTTTATTTTTTCAGGTTGAGACTGAGCTAAAGTTAATCTGTGGC
...
The .agp file is a kind of index that tells how the .fa file is built. It looks like this:
17/NT_077768 1 6538 1 D AC021317.18 122280 128817 -
17/NT_077768 6539 56206 2 D AC021317.18 128918 178585 -
17/NT_077768 56207 56306 3 N 100 fragment yes
17/NT_077768 56307 117971 4 D AC021317.18 47188 108852 -
17/NT_077768 117972 170563 5 F AC115992.13 23659 76250 +
17/NT_077768 170564 274979 6 D AC124789.11 1 104416 -
...
Each line represents either an actual sequence record or a gap (unless it begins with "#",
in which case it is a comment.) If the line represents an actual sequence record then it has the
form:
<chromosome/ctg> <start-in-ctg> <end-in-ctg> <number> <type> <accession>.<version> <start> <end> <orientation>
If it represents a gap it has the form:
<chromosome/ctg> <start-in-ctg> <end-in-ctg> <number> N <number-of-Ns> <kind> <bridged?>
The positions <start-in-ctg> and <end-in-ctg> are the start and end positions where the
sequence is to be put in the .fa file. For a sequence record, the positions <start> and
<end> are the start and end positions of where the sequence came from in the GenBank record
<accession>.<version>. The field <orientation> tells whether or not the sequence
must be reverse complemented before it is inserted into its place in the .fa file. For example, the
records above mean that to build the .fa file for clone contig NT_077768 from chromosome 17 you
take:
AC021317 version 18, residues 122280 to 128817, reverse complemented, followed by
AC021317 version 18, residues 128918 to 178585, reverse complemented, followed by
a gap of 100 Ns, followed by
AC021317 version 18, residues 47188 to 108852, reverse complemented, followed by
AC115992 version 13, residues 23659 to 76250, followed by
AC124789 version 11, residues 1 to 104416, reverse complemented, followed by
...
The joins perfectly abut. In a sequence record, <type> can have the values:
-
F - Finished
-
A - in Active finishing
-
D - Draft
-
P - PreDraft
-
O - Other sequence
In a gap record the type is always N. The <number> field sequentially numbers the records.
In a gap record, <number-of-Ns> is the size of the gap and <kind> is:
-
fragment - a gap between two sequence contigs (also called a "sequence gap")
-
split_finished - a special sized gap between two finished sequence contigs
-
clone - a gap between two clones that do not overlap
-
contig - a gap between clone contigs in the genome layout (also called a
"layout gap"li>
-
centromere - a gap inserted for the centromere
-
short_arm - a gap inserted at the start of an acrocentric chromosome
-
heterochromatin - a gap inserted for an especially large region of heterochromatin (may include
the centromere as well.)
-
telomere - a gap inserted for a telomere
The <bridged?> value is "yes" if there is a cDNA or BACend pair or plasmid end
pair that spans the gap, else it is "no".
We provide three ways you can download these .fa and .agp files:
-
full data set: the entire hierarchy in a zipped format
-
by chromosome: one zipped file for each chromosome containing all the sequence ordered along that
chromosome
-
by individual clone contig: separate files, not zipped, for each clone contig