Materials and methods
How is the database constructed
Most of the data included in the MitoAge database is computed offline, using a series of automated scripts.
For a given set of refseq sequences, these scripts generate pre-computed compositional features for the total mitochondrial genome,
as well as, for the different coding and non-coding mitochondrial regions. The generated CSV outputs are then checked for errors,
potential inconsistencies with the NCBI RefSeq database and/or other issues. This is done in a first instance using validation
scripts, and if flags are raised, then the data is analyzed manually. If inconsistencies in the original data are found, they are
fixed and reported to the authors of the relevant database (NCBI, ITIS, AnAge, etc). Upon a successful validation, the data is uploaded
into the MitoAge database and other statistical metrics are computed.
Data is analyzed as follows:
For each protein-coding gene and for the total protein-coding region, both base composition and codon usage are computed.
For total mtDNA, the D-loop region, total tRNA-coding regions, total rRNA-coding regions, and for each of the rRNA genes only base compostion is analyzed.
How is the data calculated?
- The sequences for all the species in the database are taken from the RefSeq database at NCBI.
- The maximum lifespan records are taken from the AnAge database.
- Full taxonomy data is taken from the ITIS database.
- The mtDNA sequence is taken from the Heavy strand.
- D-loop is taken from the Heavy strand when it is labeled as D-loop in NCBI.
- For computing base composition, for each gene (protein-/rRNA-/tRNA-coding gene), we use the heavy strand (according to
NCBI GenBank, if the gene position says "Complement", it is on the light strand; else, it is on the heavy strand).
- For computing codon usage, data is taken from the complement of the coding strand and T is replaced with U (i.e. the mRNA sequence).
- When combining multiple genes (e.g. when computing the total protein-coding genes), we append one gene to another. In case of overlap,
we count overlapping sequence only once for the computation of base composition, and we count them twice for the computation of codon usage.
- Sequence are always analyzed from 5' to 3' (both for DNA and RNA).