Skip to main content


Detecting the molecular scars of evolution in the Mycobacterium tuberculosis complex by analyzing interrupted coding sequences

Article metrics



Computer-assisted analyses have shown that all bacterial genomes contain a small percentage of open reading frames with a frameshift or in-frame stop codon We report here a comparative analysis of these interrupted coding sequences (ICDSs) in six isolates of M. tuberculosis, two of M. bovis and one of M. africanum and question their phenotypic impact and evolutionary significance.


ICDSs were classified as "common to all strains" or "strain-specific". Common ICDSs are believed to result from mutations acquired before the divergence of the species, whereas strain-specific ICDSs were acquired after this divergence. Comparative analyses of these ICDSs therefore define the molecular signature of a particular strain, phylogenetic lineage or species, which may be useful for inferring phenotypic traits such as virulence and molecular relationships. For instance, in silico analysis of the W-Beijing lineage of M. tuberculosis, an emergent family involved in several outbreaks, is readily distinguishable from other phyla by its smaller number of common ICDSs, including at least one known to be associated with virulence. Our observation was confirmed through the sequencing analysis of ICDSs in a panel of 21 clinical M. tuberculosis strains. This analysis further illustrates the divergence of the W-Beijing lineage from other phyla in terms of the number of full-length ORFs not containing a frameshift. We further show that ICDS formation is not associated with the presence of a mutated promoter, and suggest that promoter extinction is not the main cause of pseudogene formation.


The correlation between ICDSs, function and phenotypes could have important evolutionary implications. This study provides population geneticists with a list of targets, which could undergo selective pressure and thus alters relationships between the various lineages of M. tuberculosis strains and their host. This approach could be applied to any closely related bacterial strains or species for which several genome sequences are available.


Recent in silico surveys showed that most bacterial genomes contain interrupted coding sequences (ICDSs) [13]. These ICDSs generally result from the insertion or deletion of nucleotides, affecting the frame read and splitting the original coding sequence into two or more smaller open reading frames. These mutations may also result in a shift in reading frame, thereby altering the carboxy-terminus of the protein. ICDSs may be present in genes with known or unknown functions, or in hypothetical open reading frames [4]. Reported prokaryotic genomes have a mean of 74 ICDSs per genome, corresponding to 1 to 5% of the genes present, irrespective of genome size or GC content [2, 3]. One of the few exceptions is the genome of M. leprae, which contains about 30% ICDSs, frequently described as pseudogenes [2, 5]. The accumulation of mutations in this species is thought to be due to the loss of the proofreading activity of the DnaQ subunit of DNA polymerase III [6]. A similar sort of reductive evolution is also observed in the case of M. ulcerans [7] or for species of the genus Rickettsiales [8]. ICDSs may correspond to authentic mutations, generally resulting in a loss of function, but may in some cases reflect sequencing errors. These sequencing errors are misleading when conducting genomic analysis, but have been shown to account for only some of the detected ICDSs [4, 912]. Most ICDSs correspond to authentic mutations and can therefore be compared between strains, making it possible to explore conserved and unique mutation events.

The availability of complete genomes sequences for genetically related organisms has facilitated comparative analyses of ICDSs. This simple concept, which has not been reported before, enables to investigate evolutionary relationship between isolates or species. In this study, we took the finished genome of two mycobacterial species as a model: M. tuberculosis, which causes tuberculosis in humans, and M. bovis, which principally causes tuberculosis in ruminants. We also studied six phylogenetically distinct isolates of M. tuberculosis – H37Rv, CDC1551, Haarlem, F11, C [13], and 210 (a representative of the W-Beijing family) and M. africanum, a species of the M. tuberculosis complex for which the genome sequence is still at the assembly step. These isolates are different from each other as they belong to distinct evolutionary branches of the M. tuberculosis species, sensu stricto (s.s), yet more closely related to each other than to the more distantly related members of the M. tuberculosis complex (M. africanum, M. bovis, M. microti and M. pinnipedii) [14]. The W-Beijing family is a clonal group of highly successful M. tuberculosis strains associated with multiple outbreaks [15]. This family is one of the oldest lineages to diverge as determined by single nucleotide polymorphism (SNP) and region of deletion analysis [14]. In contrast, H37Rv, the first M. tuberculosis strain to be completely sequenced is believed to be one of the most recent (youngest) lineages of M. tuberculosis [14, 16]. Strain CDC1551 belongs to a lineage that branched between the W-Beijing and the H37Rv isolates. Overall these three isolates represent 3 different genetic groups of the species [1417]. These isolates have been studied in detail and display differences in genotype [14, 18], phenotype and virulence properties [19, 20]. By comparing the open reading frames containing frameshifts in these organisms, we showed that ICDSs could be classified as "common to all strains" or "lineage- or strain-specific". The common ICDSs probably correspond to mutations occurring before the divergence of the isolates, whereas lineage- or strain-specific ICDSs correspond to more recently acquired mutations. Thus, ICDS investigation can be used to characterize the molecular scars of evolutionary relationships between organisms and may well provide a unique molecular signature for a particular strain or species, complementary to single nucleotide polymorphism (SNP) and other molecular markers analyses for the characterization of strain variation [18, 21]. We also show that ICDS formation is not associated with mutation in the promoter region. The present data suggests that promoter extinction is not a major event in the "pseudogenization" process. To experimentally prove that ICDSs comparison is a powerful phylogenomic tool, we analyzed 21 clinical M. tuberculosis isolates for their ICDS content. We showed that the W-Beijing lineage differs from the other TB phyla by a lower number of common ICDSs, confirming early divergence with M. tuberculosis s.s strains. ICDS characterization in addition to phylogenetic investigations or typing can be used to select strains or phenotypes for studies of particular phenotypic characters, such as virulence. Indeed, as frameshift acquisition may lead to a loss of function, researchers should consider the possible presence of ICDS before choosing a strain or species for investigating a particular phenotype.


Detecting the molecular scars of evolution in M. tuberculosis and in M. bovis

Comparative analyses of frameshift-containing genes require the complete genome sequences of closely related organisms. The TB complex, which includes two recently sequenced species and at least 6 accessible strains, is therefore a highly suitable model. We investigated ICDSs in M. tuberculosis and in M. bovis. The genome sequence of M. tuberculosis H37Rv has been available since 1998 and has recently been re-annotated [22, 23]. The genome sequences of M. tuberculosis strain CDC1551 and M. bovis have been characterized independently [18, 24]. The great advantage of studying this model system is that the evolution of these two species and the phylogenetic links between them are well documented [25]. The M. tuberculosis genomes (CDC1551 and H37Rv) have nucleotide sequences more than 99.95 % identical to that of M. bovis [18, 24]. The three genomes were screened for the presence of ICDSs. To this end, the genomic sequences of each predicted ICDS [3] were extracted for each strain or species and compared between them. Each common or specific ICDS was then analyzed manually to characterize the molecular event leading to the detected frameshift. The genome of H37RV contains 113 ICDSs, whereas CDC1551 has 137 ICDSs and M. bovis has 134 ICDSs, corresponding to about 2% of the total coding sequences [3]. These organisms have similar numbers of ICDSs, but the alterations do not always affect the same genes. We therefore investigated whether some of these ICDSs were common to all three organisms. We compared the nucleotide and deduced amino-acid sequences of each frameshift-containing open reading frame in the three organisms. We found that 81 of the frameshift-containing genes were common to all three strains (Figure 1A, Table 1), and were identical at the molecular level. The proteins affected by these frameshifts included proteins of unknown function as well as annotated and/or characterized proteins (Table 1). The fact that these three mycobacterial genomes were sequenced and assembled independently suggests that these 81 common ICDSs correspond to authentic frameshift-containing genes rather than sequencing errors. These results indicate that these 81 ICDSs correspond to frameshifts acquired before the splitting of the M. tuberculosis and M. bovis species (Table 1). Alternatively, the same 81 genetic mutations may result from convergent evolution and hence have occurred independently in all three genomes, a highly unlikely scenario.

Figure 1

A- Schematic representation of the ICDSs common to M. tuberculosis H37Rv, CDC1551 and M. bovis AF2122/97 or specific to one of these strains. The total number of ICDSs is indicated. B- Schematic representation of the ICDSs of M. bovis BCG 1173P2 compared to the other analyzed strains.

Table 1 List of the 81 ICDSs common to M. tuberculosis H37Rv, CDC1551, M. bovis AF2122/97 and M. africanum GM041182.

The two M. tuberculosis s.s strains were found to have 19 additional common ICDSs, raising their total number to 100 (Figure 1A, Table 2). This suggests that the 19 additional mutations common to these two strains but not to M. bovis were acquired post-divergence of M. tuberculosis and M. bovis. One ICDS in M. bovis (ICDS0046, Mb1789c-Mb1790c) was present in M. tuberculosis CDC1551 (ICDS0057, MT1807) but not in M. tuberculosis H37Rv (Rv1759c). This mutation (deletion of one G) was identical in the M. bovis and M. tuberculosis CDC1551 strains, but an additional mutation was present close to this mutation in the M. bovis genome. One ICDS in M. bovis (ICDS0128, Mb3813-Mb3814) was also present in M. tuberculosis H37Rv (ICDS0118, Rv3784-Rv3785) but not in M. tuberculosis CD1551 (MT3893) (Table 2).

Table 2 List of the 19 ICDSs common to M. tuberculosis H37Rv and CDC1551, the ICDSs common to M. tuberculosis H37Rv and M. bovis AF2122/97 and the ICDSs common to M. tuberculosis CDC1551 and M. bovis AF2122/97

The availability of genomic resources for M. tuberculosis is increasing exponentially. This enabled us to investigate the presence or absence of these shared ICDSs in the Haarlem, F11, and C strains, the genomic sequences of which are currently at the assembly stage at the Broad Institute [26]. As the sequence of these genomes is in progress, the total number of frameshift-containing genes in these genomes cannot yet be accurately determined; nonetheless, it is possible to check whether the 81 ICDSs present in M. bovis and in other M. tuberculosis strains are present in these strains. All 81 ICDSs common to all three strains previously tested were also present in Haarlem and F11 strains, while 79 were present in the C strain (corresponding H37Rv ORFs ICDS0103 and ICDS0105 were full-length in this strain) (see Additional file 1). Noteworthy, was the identification of additional mutations in the vicinity (≤ 200 bp) of the original frameshift (see additional file 1). We next investigated whether the 19 ICDSs common to all M. tuberculosis s.s strains were present in the other clinical isolates. In each case, the ICDSs were also present in the three strains (Haarlem, F11, and C), but accompanied, in some cases, by additional mutations in the flanking region (see Additional file 1). Thus, 98 frameshift-containing genes were found to be conserved in all five M. tuberculosis strains analyzed.

The recently published M. bovis BCG genome sequence is of a particular interest in this respect [27]. This strain, which is currently used for vaccination in humans, was derived from M. bovis after 13 years of repetitive passages in vitro [28]. A number of genetic differences, such as deletions and duplications had already been identified in the BCG strain [29, 30], but large amounts of additional information have now been obtained from its genome sequence. According to our investigation, M. bovis BCG 1173P2 contains 127 ICDSs in total, 9 of which are strain-specific (Figure 1B). The 81 ICDSs common to the 3 other isolates are also present in this strain (Table 1) and 35 ICDSs are common to the M. bovis strain. We detected frameshift-containing genes in M. bovis AF2122/97 that corresponded to full-length ORFs in M. bovis BCG 1173P2, suggesting that this M. bovis strain is not the direct progenitor of the BCG vaccine (see Additional file 2).

Strain-specific ICDSs reflect newly acquired mutations and are a useful phylogenetic tool

Eighty-one ICDSs were common to all three strains, but some were specific to one strain only: 12 for M. tuberculosis H37Rv (see Additional file 3), 36 for CDC1551 (see Additional file 4) and 51 for M. bovis (see Additional file 2, Figure 1A). The proportion of ICDSs that were strain-specific was highly variable. These ICDSs accounted for 10% of all ICDSs in H37Rv, 26% in CDC1551 and 38% in M. bovis. The much larger proportion of strain-specific ICDSs in CDC1551 than in H37Rv strain is surprising, and we currently have no reasonable explanation for this phenomenon. A plausible hypothesis is that the genome sequence of CDC1551 strain has not been re-sequenced like the H37Rv genome sequence [22, 28]. Strain-specific frameshift-containing genes most likely correspond to mutations acquired after the divergence of these strains. Like the common ICDSs, these events affected genes from several classes, including "unknown or hypothetical ORFs", "intermediary metabolism" and "cell wall, process" (Additional files 2, 3 and 4). As stated above, few of these strain-specific ICDSs may correspond to errors introduced during the sequencing procedure [4, 11], but such errors would nonetheless have only a slight effect on the overall outcome of the comparative analysis.

This study shows that the genome sequence of M. tuberculosis contains ICDSs that have been acquired during the evolution of this species. The pool of ICDSs can be classified into ICDSs common to a set of strains or species and ICDSs specific to a particular strain-lineage or strain, revealing genetic differences between strains or species.

Using ICDS comparisons to type W-Beijing strains and other M. tuberculosis lineages

W-Beijing is a lineage of M. tuberculosis that has attracted considerable attention. Indeed, strains of this lineage have been implicated in severe outbreaks and have been shown to have different genetic and phenotypic properties [20, 21, 31]. The genome of a strain of the W-Beijing family (strain 210) is currently sequenced but not yet fully assembled; nevertheless it can be consulted in homology searches. Consequently the total number of frameshift-containing genes in this species and the full characterization of specific ICDSs remain elusive. It is however possible to screen for the presence of ICDSs in this strain.

We first investigated whether the 81 frameshift-containing genes common to all strains were also present in the genome of strain 210. All 81 of these genes also contained the same frameshift in strain 210, in agreement with the data described above. This suggests that these 81 frameshift mutations were acquired before the divergence of strain 210 from these other strains. We then investigated the 19 genes containing frameshifts common to the five strains of M. tuberculosis (H37Rv, CDC1551, Haarlem, F11, C) but not to M. bovis. We found that eight of these 19 genes contain no frameshift in strain 210, and hence corresponded to full-length ORFs (Table 2). Three genes contained frameshifts corresponding to those observed in strains CDC1551, H37Rv, Haarlem, F11 and C, but also contained additional mutations in the corresponding flanks (≤ 200 bp) of the original frameshift (Table 2). The remaining 11 ICDSs corresponded to frameshift-containing genes common to all six TB strains examined (CDC1551, H37Rv, Haarlem, F11, C, 210) and the events were identical at the molecular level. Thus, the 19 frameshift-containing genes in the two TB strains (CDC1551 and H37Rv) displayed polymorphism in strain 210 and 11 of these identified ICDSs were common to all six TB strains examined. Some of these ICDSs display no further mutation (the gene contains the frameshift alone), whereas others have acquired additional mutations, contributing to the "pseudogenization" process (data not shown).

We then investigated the eight ICDSs showing polymorphism in M. tuberculosis in 21 strains of the W-Beijing lineage from several phylogenetic groups (Table 3). The eight loci were amplified by PCR, sequenced and the nucleotide sequence was compared with that of strains 210 and H37Rv. In all W-Beijing strains tested, the eight genes were full-length, with sequences 100% identical to that in strain 210, excepted for the ICDS0085 where a non-disruptive SNP is present in the region. The W-Beijing lineage is therefore a genetically homogeneous group with fewer ICDSs in common with other TB strains.

Table 3 Analysis in 21 W-Beijing isolates of the 8 ICDSs of H37Rv strain corresponding to full-length ORFs in W-Beijing strain 210.

To extend our analysis, we investigate the M. africanum strain, which is currently sequenced at the Sanger centre. Similarly to M. tuberculosis 210 strain, the M. africanum genome is still at the assembly step, but can be nevertheless consulted on line. We investigated whether the 81 frameshift containing genes common to all strains tested were also present in the M. africanum strain (Table 1). All 81 of these genes also contained a frameshift in M. africanum, which suggests that these mutations were acquired before the divergence of the M. tuberculosis complex. We then investigated the 19 genes containing frameshift common to the 5 M. tuberculosis strains (CDC1551, H37Rv, Haarlem, F11, C). We found that 15 out of these 19 genes were deprived of the frameshift in M. africanum and corresponded to full-length ORFs in this strain (Table 2). Eight out of these 15 genes match the wild-type ORFs identified in M. tuberculosis strain 210 and other strains of the W-Beijing lineage. In conclusion, the genome of M. africanum contains fewer ICDSs in common with the other TB isolates (CDC1551, H37Rv, Haarlem, F11, C) than with the W-Beijing strain and seems genetically closer to this lineage.

ICDS formation is not correlated with mutation in the promoter region

It has been suggested that pseudogene formation is associated with mutations in the upstream untranslated region, abolishing pseudogene expression to prevent a loss of metabolic function [32]. Once turned off, the gene continues to accumulate mutations, leading to complete pseudogene formation. ICDSs are not pseudogenes in the strict sense of the word. Indeed, the ORF is split into only two or three unframed fragments and can, in theory, revert to a wild-type allele. ICDSs are therefore considered to be ORFs undergoing "pseudogenization" rather than pseudogenes per se. Strain-specific ICDSs are, by definition, genes that are mutated in one strain, but not in another. We therefore investigated whether ICDS formation was correlated with mutation in the promoter region. All the intergenic regions (99) located upstream from strain-specific ICDSs of M. tuberculosis H37Rv, CDC1551 and M. bovis were compared with the corresponding region in the two strains having a wild-type gene. We used as a control the promoter region of randomly selected genes that are full-length in these 3 strains. We compared the level of differences observed in the promoter regions of genes full-length or containing frameshift. Nucleotide differences were observed in 27% of the upstream region of genes containing frameshift (see Additional file 5A), while 20% was observed in the case of the full-length genes (see Additional file 5B), which is not statistically significant using the chi square test. In all but 6 cases for ICDS and 2 cases for full-length genes, the difference in the upstream region was limited to one or two SNPs.

We therefore conclude that ICDS formation is not correlated with mutation in the untranslated upstream region and suggest that either promoter mutations do not play a major role in pseudogene formation in the M. tuberculosis complex or that "pseudogenization" is recent.


The presence of frameshift-containing genes in bacterial genomes is well documented [13, 33]. A few species can bypass such frameshifts, but most do not, generally resulting in a loss of function.

We show here that ICDSs can be classified as "common to all strains" or "strain-specific". The ICDSs common to all strains probably correspond to mutations acquired before the divergence of the strains, whereas strain-specific ICDSs correspond to those acquired subsequently (Figure 2). Mutations acquired after the speciation of M. tuberculosis from M. bovis were also detected. We identified 19 ICDSs common to the five M. tuberculosis strains (H37Rv, CDC1551, Haarlem, F11 and C) but not to M. bovis, about one-fifth of ICDSs common to all strains. Comparative analyses of ICDSs help to characterize the phylogenetic relationships between highly related strains and species (Figure 2) and could be applied to any bacterial species for which several genome sequences are available. In few cases, ICDSs may correspond to fusion/fission of orthologous genes in other genomes. The detection of this kind of events is due to the method of identification of ICDS but remains however a minor inconvenience [3]. It is however possible that a low percentage of specific ICDSs does correspond to sequencing errors, inducing thus artifactual phylogenetic relationships. Researchers should resequence these regions before assuming that the ICDS corresponds to a frameshift acquisition. Several studies have compared the genome sequences of M. tuberculosis CDC1551 and H37Rv, using high-resolution genomics techniques [18]. This has led to the definition of regions containing large-sequence polymorphisms (LSPs, greater than 10 bp) and single nucleotide polymorphisms (SNPs). The SNPs have been investigated in more detail in various clinical isolates, to draw up a global phylogeny of M. tuberculosis [17]. Other molecular methods, such as analyses of the deleted regions (deligotyping), variable numbers of tandem repeats (VNTR), mycobacterial interspersed repetitive unit (MIRU) and spoligotyping, have helped to unravel global genomic sequence diversity in this species [3436]. These techniques are highly useful for epidemiological studies, but as far provide little information pertaining to genetic differences in terms of putative function. In contrast, studies of regions of deletion (RD) have proved useful for both global phylogeny and study of a loss of phenotype in both M. tuberculosis and in M. ulcerans [25, 30, 37].

Figure 2

Hypothetical phylogenetic links assessed by comparative analyses of ICDSs. In this schematic representation, the common ancestor gave rise to several branches of strains of the TB complex. Eighty-one frameshifts were acquired during the common evolution of M. bovis and M. tuberculosis. Since the separation of these species, M. bovis has acquired 51 frameshifts, while the branch leading to M. tuberculosis isolates has acquired 19 new frameshifts. Since separation of the isolates, M. tuberculosis H37Rv has acquired 12 new frameshifts and CDC1551 36 new frameshifts. Common and unique ICDSs are shown in dark and light gray, respectively. "*" these 8 ICDSs correspond to full-length ORF in M. tuberculosis 210 and in M. africanum GM041182. "**" 7 out of these 11 ICDSs correspond to full-length ORF in M. africanum GM041182 (Table 2).

Frameshift acquisition generally leads to a loss of function, as shown in a number of published studies. Loss-of-function associated with the presence of a frameshift has been reported in both M. tuberculosis and M. bovis. For instance, ICDS0066 in M. tuberculosis H37Rv corresponds to a frameshift-containing gene encoding a polyketide synthase (pks1). This pks1 gene also contains a frameshift in M. tuberculosis CDC1551, resulting in two different ORFs: pks1 and pks15. In contrast, M. bovis and M. leprae carry a full-length functional pks1 gene [38]. The pks15/1 gene is now frequently used as a marker in epidemiological studies [39, 40] and, interestingly, the pks gene contains no frameshift in the W-Beijing strains of M. tuberculosis [40], resulting in phenolglycolipid production in most cases [41]. Our analysis shows that the pks gene of M. africanum is also full-length suggesting that this species produces PGL. This observation suggests that these early strains are more closely related to M. bovis or to the last ancestor than other M. tuberculosis strains. Similarly, ICDS0067 in M. bovis corresponds to a putative frameshift-containing glycosyltransferase gene. The ortholog of this gene has no frameshift in the two strains of M. tuberculosis (Rv2958c and MT3034). Functional complementation of M. bovis BCG with the Rv2958c gene from M. tuberculosis leads to the accumulation of a new metabolite, the diglycosylated phenolglycolipid [42]. Some frameshift-containing genes have been studied experimentally in M. tuberculosis, without considering the possibility that these ORFs may well contain frameshift [43, 44]. Mutation by homologous recombination has been achieved at the mntH and mmpL13 loci. In both cases, no detectable phenotype was associated with the mutation. Our data indicate that MmpL13 function should be investigated in a W-Beijing strain or in M. africanum. Another example that has not yet been studied is the pks3 and pks4 genes of M. tuberculosis H37Rv, which constitute a single ORF in CDC1551 and in M. bovis. This suggests that – like the pks1 and pks15 genes, which are pseudogenes in M. tuberculosis – the pks3 and pks4 genes are probably not functional in the H37Rv strain. It would therefore be pointless to investigate function in the H37Rv strain by creating mutants in pks3 and pks4 genes or by expressing constructs encoding the corresponding polypeptides. These examples from previous publications illustrate the major biological impact of frameshift acquisition. They demonstrate the importance of choosing the right strain or species for investigations of the function of a particular gene. However, it is not always possible to infer from the position of the frameshift whether the protein's activity will be affected. For instance, GlnA3, a glutamine synthetase generated from a frameshift-containing gene (Table 1), has been purified and shown to retain some activity [45]. It would be interesting to reframe these ORFs to test the impact of frameshift on protein function. On the other hand, it has been shown in silico that protein-coding sequences can be tolerant of frameshift translation events and thus that frameshit acquisition is an important reservoir for creating novel proteins [46]. Several of the truncated ORFs described here have also been detected in other studies, based on different analyses [17, 18, 40, 47, 48]. However, we present here a comprehensive comparative analysis of three related mycobacterial species and nine strains at the ICDS level.

We found no association between ICDS formation and mutation in the promoter region of the corresponding ORF. This suggests that promoter mutation and inactivation of gene expression are not the principal source of ICDS formation and hence of pseudogene accumulation in the M. tuberculosis complex. It may also suggest that ICDS formation in these species is a recent process. We favor the hypothesis that ORFs are first split into two or three parts, inactivating their function, and are then subject to secondary mutation (in both the ICDS and the untranslated region), leading to irreversible pseudogene fixation. Consistent with this hypothesis, we have observed additional mutations in the vicinity of the original frameshift in some strains.

We have shown that ICDS investigation can be used to infer the evolutionary relationships between strains and species. We provide here a list of more than 150 ICDSs that may be useful for characterizing TB strains and inferring phylogenetic relationships. The genome sequences of more than 10 TB strains will be released in the near future [26], and will, by no doubt, identify some new common and strain-specific ICDSs. Strain typing should clearly combine various markers, such as SNPs, MIRU, LSPs, RD, PE polymorphism [49] and ICDSs, in a matrix-based comparison from which the global phylogeny of TB isolates may be deduced. The polymorphism associated with these mutations is complementary to other methods [17, 34, 36, 37, 50], hence can be used to explore genetic diversity within a given species. Interestingly, in strain 210, from the W-Beijing family, eight of the 19 ICDSs common to the five M. tuberculosis strains tested (H37Rv, CDC1551, Haarlem, F11, C) corresponded to full-length ORFs, illustrating its earlier divergence. Some of these genes may be involved in virulence, as they concern functions such as host cell invasion (ICDS0011 of H37Rv), lipid biosynthesis (ICDS0066 and ICDS0031 of H37Rv) and intermediary metabolism (ICDS0085 of H37Rv). To test whether this trait was a particularity of the 210 strain or applied more generally to the W-Beijing phylum, we sequenced these eight ORF that were full-length in this strain in 21 other clinical isolates of the W-Beijing (Table 3). In all cases, the ORF were corresponding to a full-length ORF and not to an ICDS, demonstrating that these strains are genetically homogenous. The analysis performed using a strain of M. africanum showed that this species is characterized by an even fewer number of ICDSs common to M. tuberculosis H37Rv and CDC1551 than to the W-Beijing strains. More genome sequences of various strains and species are required for characterization of the genetic differences between the W-Beijing strains and other species of the M. tuberculosis complex. The alkA gene has been shown to contain frameshift in both M. bovis and some M. tuberculosis isolates from Central African Republic [48]. The presence of SNPs in the adjacent region of the non-sense mutation has led the authors to propose a convergent evolution. Although, it probably depends from genes to genes, we instead favor the hypothesis that the non-sense mutation was acquired by the ancestor and spread to the progeny with acquisition of subsequent mutations in the adjacent region. Epidemiologists should bear in mind that a small percentage of ICDSs may correspond to sequencing errors [4, 11], generating artifactual genetic differences. Our analysis did not allow for the detection of mutations in which the frame of the coding sequence was conserved (synonymous mutation, in frame deletion), decreasing the total level of diversity observed. However, comparative ICDS analysis presents the major advantage of making it possible to associate the frameshift with a putative function and, possibly, with a particular phenotype. In conclusion, more attention should be paid to ICDS detection and comparison, particularly at the genomic scale.


We report here a comparative analysis of ICDSs in six isolates of M. tuberculosis, two of M. bovis and one of M. africanum. We show that these ICDSs can be classified as "common to all strains" or "strain-specific". Common ICDSs result from mutations acquired before the divergence of the species, whereas strain-specific ICDSs were acquired after this divergence. Comparative analyses of these ICDSs allow the definition of the molecular signature of a particular strain, phylogenetic lineage or species. We further show that ICDS formation is not correlated with the presence of a mutated promoter, and suggest that promoter extinction is not the main cause of pseudogene formation. The correlation between ICDSs, function and phenotypes could have important evolutionary implications and provides population geneticists with a list of targets, which could undergo selective pressure and thus alters relationships between the various lineages of M. tuberculosis strains and their host.



The genome sequences of M. tuberculosis H37Rv and CDC1551 and M. bovis AF2122/97 were taken from TIGR website [51]. The genome sequences of M. tuberculosis strains 210 or F11, C and Haarlem have been consulted on the TIGR or Broad Institute websites [52]. The genome sequence of M. bovis BCG 1173P2 has been taken from National Center for Biotechnology Information (NCBI) website (accession number, AM408590). The genome sequence of M. africanum GM041182 was consulted on line at the Sanger centre [53].

Detection of common ICDS

The genomic sequences of M. tuberculosis CDC1551, M. tuberculosis H37Rv, M. bovis AF2122/97 and M. bovis BCG 1173P2 have been scanned for couple of adjacent coding sequences that exhibit common homologs after translation. Such pair of coding sequences is considered as an ICDS if no paralogy relationship exists between the two coding sequences. The detailed description of ICDS detection is described in [3]. The ICDSs detected in each strain were then cross-compared by all-against-all blastn searches. For each ICDS, the best hits (E < 10-65) detected in the different strains were manually analysed to discriminate common and strain-specific ICDS.

Sequencing analysis

Chromosomal DNA of M. tuberculosis isolates from various lineages (Table 3) was used as a template for PCR amplification of the selected locus. The primers used to amplify and sequence were designed as previously described [3], using an optimized version of CADO4MI [54]. The nucleotide and deduced amino-acid sequences were analyzed with DNA Strider [55].

Promoter analysis

A region of 200 bp upstream the initiation codon was extracted for each of the 99 ICDSs specific to M. tuberculosis H37Rv, CDC1551 and M. bovis AF2122/97 (Additional files 2, 3 and 4). As a control group, 200 bp upstream the initiation codon was extracted for 99 genes (full-length) randomly selected from M. tuberculosis H37Rv. These 99 genes are full-length in M. tuberculosis H37Rv, CDC1551 and M. bovis AF2122/97. In each case (promoter to be tested and control group), the promoter regions of the 3 strains were aligned using ClustalW [56] and the sequence variation was recorded. The number of differences observed in the upstream region was statistically compared using the Chi2 test.

Statistical analysis

The statistical significance of the distribution of the frequency of sequence polymorphism observed in the upstream ICDS regions and upstream full-length regions, was tested using a Chi square test (X2). The chi square test is used to determine relationship between two distributions. The calculated values were obtained: X2: 1,367, df: 1, P value: 0.2423, hence the difference between 2 groups are not statistically significant (α < 0.05).



Interrupted CoDing Sequence. ORF, Open Reading Frame.


  1. 1.

    Cruveiller S, Le Saux J, Vallenet D, Lajus A, Bocs S, Medigue C: MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes. Nucleic Acids Res. 2005, W471-479. 10.1093/nar/gki498. 33 Web Server

  2. 2.

    Liu Y, Harrison PM, Kunin V, Gerstein M: Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol. 2004, 5 (9): R64-10.1186/gb-2004-5-9-r64.

  3. 3.

    Perrodou E, Deshayes C, Muller J, Schaeffer C, Van Dorsselaer A, Ripp R, Poch O, Reyrat JM, Lecompte O: ICDS database: interrupted CoDing sequences in prokaryotic genomes. Nucleic Acids Res. 2006, D338-343. 10.1093/nar/gkj060. 34 Database

  4. 4.

    Deshayes C, Perrodou E, Gallien S, Euphrasie D, Schaeffer C, Van-Dorsselaer A, Poch O, Lecompte O, Reyrat JM: Interrupted coding sequences in Mycobacterium smegmatis : authentic mutations or sequencing errors?. Genome Biol. 2007, 8 (2): R20-10.1186/gb-2007-8-2-r20.

  5. 5.

    Gomez-Valero L, Rocha EP, Latorre A, Silva FJ: Reconstructing the ancestor of Mycobacterium leprae: the dynamics of gene loss and genome reduction. Genome Res. 2007, 17 (8): 1178-1185. 10.1101/gr.6360207.

  6. 6.

    Cole ST, Eiglmeier K, Parkhill J, James KD, Thomson NR, Wheeler PR, Honore N, Garnier T, Churcher C, Harris D: Massive gene decay in the leprosy bacillus. Nature. 2001, 409 (6823): 1007-1011. 10.1038/35059006.

  7. 7.

    Stinear TP, Mve-Obiang A, Small PL, Frigui W, Pryor MJ, Brosch R, Jenkin GA, Johnson PD, Davies JK, Lee RE: Giant plasmid-encoded polyketide synthases produce the macrolide toxin of Mycobacterium ulcerans. Proc Natl Acad Sci USA. 2004, 101 (5): 1345-1349. 10.1073/pnas.0305877101.

  8. 8.

    Darby AC, Cho NH, Fuxelius HH, Westberg J, Andersson SG: Intracellular pathogens go extreme: genome evolution in the Rickettsiales. Trends Genet. 2007, 23 (10): 511-520. 10.1016/j.tig.2007.08.002.

  9. 9.

    Guan X, Uberbacher EC: Alignments of DNA and protein sequences containing frameshift errors. Comput Appl Biosci. 1996, 12 (1): 31-40.

  10. 10.

    Hayashi K, Morooka N, Yamamoto Y, Fujita K, Isono K, Choi S, Ohtsubo E, Baba T, Wanner BL, Mori H: Highly accurate genome sequences of Escherichia coli K-12 strains MG1655 and W3110. Mol Syst Biol. 2006, 2: 2006 0007.-10.1038/msb4100049.

  11. 11.

    Medigue C, Rose M, Viari A, Danchin A: Detecting and analyzing DNA sequencing errors: toward a higher quality of the Bacillus subtilis genome sequence. Genome Res. 1999, 9 (11): 1116-1127. 10.1101/gr.9.11.1116.

  12. 12.

    Xu Y, Mural RJ, Uberbacher EC: Correcting sequencing errors in DNA coding regions using a dynamic programming approach. Comput Appl Biosci. 1995, 11 (2): 117-124.

  13. 13.

    Friedman CR, Quinn GC, Kreiswirth BN, Perlman DC, Salomon N, Schluger N, Lutfey M, Berger J, Poltoratskaia N, Riley LW: Widespread dissemination of a drug-susceptible strain of Mycobacterium tuberculosis. J Infect Dis. 1997, 176 (2): 478-484.

  14. 14.

    Mathema B, Kurepina NE, Bifani PJ, Kreiswirth BN: Molecular epidemiology of tuberculosis: current insights. Clin Microbiol Rev. 2006, 19 (4): 658-685. 10.1128/CMR.00061-05.

  15. 15.

    Gutacker MM, Mathema B, Soini H, Shashkina E, Kreiswirth BN, Graviss EA, Musser JM: Single-nucleotide polymorphism-based population genetic analysis of Mycobacterium tuberculosis strains from 4 geographic sites. J Infect Dis. 2006, 193 (1): 121-128. 10.1086/498574.

  16. 16.

    Sreevatsan S, Pan X, Stockbauer KE, Connell ND, Kreiswirth BN, Whittam TS, Musser JM: Restricted structural gene polymorphism in the Mycobacterium tuberculosis complex indicates evolutionarily recent global dissemination. Proc Natl Acad Sci USA. 1997, 94 (18): 9869-9874. 10.1073/pnas.94.18.9869.

  17. 17.

    Filliol I, Motiwala AS, Cavatore M, Qi W, Hazbon MH, Bobadilla del Valle M, Fyfe J, Garcia-Garcia L, Rastogi N, Sola C: Global phylogeny of Mycobacterium tuberculosis based on single nucleotide polymorphism (SNP) analysis: insights into tuberculosis evolution, phylogenetic accuracy of other DNA fingerprinting systems, and recommendations for a minimal standard SNP set. J Bacteriol. 2006, 188 (2): 759-772. 10.1128/JB.188.2.759-772.2006.

  18. 18.

    Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, DeBoy R, Dodson R, Gwinn M, Haft D: Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. J Bacteriol. 2002, 184 (19): 5479-5490. 10.1128/JB.184.19.5479-5490.2002.

  19. 19.

    Manca C, Tsenova L, Barry CE, Bergtold A, Freeman S, Haslett PA, Musser JM, Freedman VH, Kaplan G: Mycobacterium tuberculosis CDC1551 induces a more vigorous host response in vivo and in vitro, but is not more virulent than other clinical isolates. J Immunol. 1999, 162 (11): 6740-6746.

  20. 20.

    Manca C, Tsenova L, Bergtold A, Freeman S, Tovey M, Musser JM, Barry CE, Freedman VH, Kaplan G: Virulence of a Mycobacterium tuberculosis clinical isolate in mice is determined by failure to induce Th1 type immunity and is associated with induction of IFN-alpha/beta. Proc Natl Acad Sci USA. 2001, 98 (10): 5752-5757. 10.1073/pnas.091096998.

  21. 21.

    Reed MB, Domenech P, Manca C, Su H, Barczak AK, Kreiswirth BN, Kaplan G, Barry CE: A glycolipid of hypervirulent tuberculosis strains that inhibits the innate immune response. Nature. 2004, 431 (7004): 84-87. 10.1038/nature02837.

  22. 22.

    Camus JC, Pryor MJ, Medigue C, Cole ST: Re-annotation of the genome sequence of Mycobacterium tuberculosis H37Rv. Microbiology. 2002, 148 (Pt 10): 2967-2973.

  23. 23.

    Cole ST, Brosch R, Parkhill J, Garnier T, Churcher C, Harris D, Gordon SV, Eiglmeier K, Gas S, Barry CE: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998, 393 (6685): 537-544. 10.1038/31159.

  24. 24.

    Garnier T, Eiglmeier K, Camus JC, Medina N, Mansoor H, Pryor M, Duthoy S, Grondin S, Lacroix C, Monsempe C: The complete genome sequence of Mycobacterium bovis. Proc Natl Acad Sci USA. 2003, 100 (13): 7877-7882. 10.1073/pnas.1130426100.

  25. 25.

    Brosch R, Gordon SV, Marmiesse M, Brodin P, Buchrieser C, Eiglmeier K, Garnier T, Gutierrez C, Hewinson G, Kremer K: A new evolutionary scenario for the Mycobacterium tuberculosis complex. Proc Natl Acad Sci USA. 2002, 99 (6): 3684-3689. 10.1073/pnas.052548299.

  26. 26.

    Bernal A, Ear U, Kyrpides N: Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 2001, 29 (1): 126-127. 10.1093/nar/29.1.126.

  27. 27.

    Brosch R, Gordon SV, Garnier T, Eiglmeier K, Frigui W, Valenti P, Dos Santos S, Duthoy S, Lacroix C, Garcia-Pelayo C: Genome plasticity of BCG and impact on vaccine efficacy. Proc Natl Acad Sci USA. 2007, 104 (13): 5596-5601. 10.1073/pnas.0700869104.

  28. 28.

    Oettinger T, Jorgensen M, Ladefoged A, Haslov K, Andersen P: Development of the Mycobacterium bovis BCG vaccine: review of the historical and biochemical evidence for a genealogical tree. Tuber Lung Dis. 1999, 79 (4): 243-250. 10.1054/tuld.1999.0206.

  29. 29.

    Brosch R, Gordon SV, Buchrieser C, Pym AS, Garnier T, Cole ST: Comparative genomics uncovers large tandem chromosomal duplications in Mycobacterium bovis BCG Pasteur. Yeast. 2000, 17 (2): 111-123. 10.1002/1097-0061(20000630)17:2<111::AID-YEA17>3.0.CO;2-G.

  30. 30.

    Pym AS, Brodin P, Brosch R, Huerre M, Cole ST: Loss of RD1 contributed to the attenuation of the live tuberculosis vaccines Mycobacterium bovis BCG and Mycobacterium microti. Mol Microbiol. 2002, 46 (3): 709-717. 10.1046/j.1365-2958.2002.03237.x.

  31. 31.

    Lopez B, Aguilar D, Orozco H, Burger M, Espitia C, Ritacco V, Barrera L, Kremer K, Hernandez-Pando R, Huygen K: A marked difference in pathogenesis and immune response induced by different Mycobacterium tuberculosis genotypes. Clin Exp Immunol. 2003, 133 (1): 30-37. 10.1046/j.1365-2249.2003.02171.x.

  32. 32.

    Mira A, Pushker R: The silencing of pseudogenes. Mol Biol Evol. 2005, 22 (11): 2135-2138. 10.1093/molbev/msi209.

  33. 33.

    Bocs S, Danchin A, Medigue C: Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes. BMC Bioinformatics. 2002, 3: 5-10.1186/1471-2105-3-5.

  34. 34.

    Goguet de la Salmoniere YO, Li HM, Torrea G, Bunschoten A, van Embden J, Gicquel B: Evaluation of spoligotyping in a study of the transmission of Mycobacterium tuberculosis. J Clin Microbiol. 1997, 35 (9): 2210-2214.

  35. 35.

    Kamerbeek J, Schouls L, Kolk A, van Agterveld M, van Soolingen D, Kuijper S, Bunschoten A, Molhuizen H, Shaw R, Goyal M: Simultaneous detection and strain differentiation of Mycobacterium tuberculosis for diagnosis and epidemiology. J Clin Microbiol. 1997, 35 (4): 907-914.

  36. 36.

    Mazars E, Lesjean S, Banuls AL, Gilbert M, Vincent V, Gicquel B, Tibayrenc M, Locht C, Supply P: High-resolution minisatellite-based typing as a portable approach to global analysis of Mycobacterium tuberculosis molecular epidemiology. Proc Natl Acad Sci USA. 2001, 98 (4): 1901-1906. 10.1073/pnas.98.4.1901.

  37. 37.

    Kaser M, Rondini S, Naegeli M, Stinear T, Portaels F, Certa U, Pluschke G: Evolution of two distinct phylogenetic lineages of the emerging human pathogen Mycobacterium ulcerans. BMC Evol Biol. 2007, 7 (1): 177-10.1186/1471-2148-7-177.

  38. 38.

    Constant P, Perez E, Malaga W, Laneelle MA, Saurel O, Daffe M, Guilhot C: Role of the pks15/1 gene in the biosynthesis of phenolglycolipids in the Mycobacterium tuberculosis complex. Evidence that all strains synthesize glycosylated p-hydroxybenzoic methyl esters and that strains devoid of phenolglycolipids harbor a frameshift mutation in the pks15/1 gene. J Biol Chem. 2002, 277 (41): 38148-38158. 10.1074/jbc.M206538200.

  39. 39.

    Gagneux S, DeRiemer K, Van T, Kato-Maeda M, de Jong BC, Narayanan S, Nicol M, Niemann S, Kremer K, Gutierrez MC: Variable host-pathogen compatibility in Mycobacterium tuberculosis. Proc Natl Acad Sci USA. 2006, 103 (8): 2869-2873. 10.1073/pnas.0511240103.

  40. 40.

    Tsolaki AG, Gagneux S, Pym AS, Goguet de la Salmoniere YO, Kreiswirth BN, Van Soolingen D, Small PM: Genomic deletions classify the Beijing/W strains as a distinct genetic lineage of Mycobacterium tuberculosis. J Clin Microbiol. 2005, 43 (7): 3185-3191. 10.1128/JCM.43.7.3185-3191.2005.

  41. 41.

    Reed MB, Gagneux S, Deriemer K, Small PM, Barry CE: The W-Beijing lineage of Mycobacterium tuberculosis overproduces triglycerides and has the DosR dormancy regulon constitutively upregulated. J Bacteriol. 2007, 189 (7): 2583-2589. 10.1128/JB.01670-06.

  42. 42.

    Perez E, Constant P, Lemassu A, Laval F, Daffe M, Guilhot C: Characterization of three glycosyltransferases involved in the biosynthesis of the phenolic glycolipid antigens from the Mycobacterium tuberculosis complex. J Biol Chem. 2004, 279 (41): 42574-42583. 10.1074/jbc.M406246200.

  43. 43.

    Boechat N, Lagier-Roger B, Petit S, Bordat Y, Rauzier J, Hance AJ, Gicquel B, Reyrat JM: Disruption of the gene homologous to mammalian Nramp1 in Mycobacterium tuberculosis does not affect virulence in mice. Infect Immun. 2002, 70 (8): 4124-4131. 10.1128/IAI.70.8.4124-4131.2002.

  44. 44.

    Domenech P, Reed MB, Barry CE: Contribution of the Mycobacterium tuberculosis MmpL protein family to virulence and drug resistance. Infect Immun. 2005, 73 (6): 3492-3501. 10.1128/IAI.73.6.3492-3501.2005.

  45. 45.

    Harth G, Maslesa-Galic S, Tullius MV, Horwitz MA: All four Mycobacterium tuberculosis glnA genes encode glutamine synthetase activities but only GlnA1 is abundantly expressed and essential for bacterial homeostasis. Mol Microbiol. 2005, 58 (4): 1157-1172. 10.1111/j.1365-2958.2005.04899.x.

  46. 46.

    Okamura K, Feuk L, Marques-Bonet T, Navarro A, Scherer SW: Frequent appearance of novel protein-coding sequences by frameshift translation. Genomics. 2006, 88 (6): 690-697. 10.1016/j.ygeno.2006.06.009.

  47. 47.

    Marri PR, Bannantine JP, Golding GB: Comparative genomics of metabolic pathways in Mycobacterium species: gene duplication, gene decay and lateral gene transfer. FEMS Microbiol Rev. 2006, 30 (6): 906-925. 10.1111/j.1574-6976.2006.00041.x.

  48. 48.

    Nouvel LX, Dos Vultos T, Kassa-Kelembho E, Rauzier J, Gicquel B: A non-sense mutation in the putative anti-mutator gene ada/alkA of Mycobacterium tuberculosis and M. bovis isolates suggests convergent evolution. BMC Microbiol. 2007, 7: 39-10.1186/1471-2180-7-39.

  49. 49.

    Karboul A, Gey van Pittius NC, Namouchi A, Vincent V, Sola C, Rastogi N, Suffys P, Fabre M, Cataldi A, Huard RC: Insights into the evolutionary history of tubercle bacilli as disclosed by genetic rearrangements within a PE_PGRS duplicated gene pair. BMC Evol Biol. 2006, 6: 107-10.1186/1471-2148-6-107.

  50. 50.

    Read TD, Salzberg SL, Pop M, Shumway M, Umayam L, Jiang L, Holtzapple E, Busch JD, Smith KL, Schupp JM: Comparative genome sequencing for discovery of novel polymorphisms in Bacillus anthracis. Science. 2002, 296 (5575): 2028-2033. 10.1126/science.1071837.

  51. 51.

    J. Craig Venter Institute. []

  52. 52.

    The BROAD Institute. []

  53. 53.

    Welcome trust Sanger Institute. Mycobacterium africanum. []

  54. 54.

    Computer assisted Design of Oligonucleotide for Microarray. []

  55. 55.

    Marck C: 'DNA Strider': a 'C' program for the fast analysis of DNA and protein sequences on the Apple Macintosh family of computers. Nucleic Acids Res. 1988, 16 (5): 1829-1836. 10.1093/nar/16.5.1829.

  56. 56.

    Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.

Download references


We thank R. Brosch and J. Belisle for the kind gift of chromosomal DNA. We thank F. Tekaïa for the help in statistical analysis. We thank S. Gagneux, P. M. Small, B. Gicquel, T. Dos Vultos and C. Sola for stimulating discussions and useful suggestions. We thank INSERM for funding this project under the Avenir program, through a grant to JMR, Chargé de Recherches at INSERM. This work was also funded by an RNG (Réseau National de Génopoles) grant to the Strasbourg Bioinformatics Platform infrastructures and EVIGENORET (LSHG-CT-2005-512036). CD is funded by a PhD grant from the Fondation pour la Recherche Médicale (FRM).

Author information

Correspondence to Jean-Marc Reyrat.

Additional information

Authors' contributions

CD helped to carry out the bioinformatic studies, analysed the TB strains by sequencing and drafted the manuscript. EP carried out the bioinformatic studies and helped to draft the manuscript. DE analysed the TB strains by sequencing. EF helped to analyze the promoter regions. OP helped to draft the manuscript. PB participated in the analysis of the W-Beijing strains and help to write the manuscript. OL participated in the design of the study, carried out the bioinformatic studies and drafted the manuscript. JMR conceived the study, participated in its design and coordination and in finalizing of the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Deshayes, C., Perrodou, E., Euphrasie, D. et al. Detecting the molecular scars of evolution in the Mycobacterium tuberculosis complex by analyzing interrupted coding sequences. BMC Evol Biol 8, 78 (2008) doi:10.1186/1471-2148-8-78

Download citation


  • Tuberculosis
  • Authentic Mutation
  • Additional Mutation
  • Tuberculosis H37Rv
  • Pks4 Gene