- Research article
- Open Access
The metazoan history of the COE transcription factors. Selection of a variant HLH motif by mandatory inclusion of a duplicated exon in vertebrates
BMC Evolutionary Biology volume 8, Article number: 131 (2008)
The increasing number of available genomic sequences makes it now possible to study the evolutionary history of specific genes or gene families. Transcription factors (TFs) involved in regulation of gene-specific expression are key players in the evolution of metazoan development. The low complexity COE (Collier/Olfactory-1/Early B-Cell Factor) family of transcription factors constitutes a well-suited paradigm for studying evolution of TF structure and function, including the specific question of protein modularity. Here, we compare the structure of coe genes within the metazoan kingdom and report on the mechanism behind a vertebrate-specific exon duplication.
COE proteins display a modular organisation, with three highly conserved domains : a COE-specific DNA-binding domain (DBD), an Immunoglobulin/Plexin/transcription (IPT) domain and an atypical Helix-Loop-Helix (HLH) motif. Comparison of the splice structure of coe genes between cnidariae and bilateriae shows that the ancestral COE DBD was built from 7 separate exons, with no evidence for exon shuffling with other metazoan gene families. It also confirms the presence of an ancestral H1LH2 motif present in all COE proteins which partly overlaps the repeated H2d-H2a motif first identified in rodent EBF. Electrophoretic Mobility Shift Assays show that formation of COE dimers is mediated by this ancestral motif. The H2d-H2a α-helical repetition appears to be a vertebrate characteristic that originated from a tandem exon duplication having taken place prior to the splitting between gnathostomes and cyclostomes. We put-forward a two-step model for the inclusion of this exon in the vertebrate transcripts.
Three main features in the history of the coe gene family can be inferred from these analyses: (i) each conserved domain of the ancestral coe gene was built from multiple exons and the same scattered structure has been maintained throughout metazoan evolution. (ii) There exists a single coe gene copy per metazoan genome except in vertebrates. The H2a-H2d duplication that is specific to vertebrate proteins provides an example of a novel vertebrate characteristic, which may have been fixed early in the gnathostome lineage. (iii) This duplication provides an interesting example of counter-selection of alternative splicing.
Thanks to the increasing number of available genomic sequences, it has become possible to study the evolutionary history of specific genes or gene families in relation to their co-option in innovations that have punctuated the evolutionary diversification of metazoans. Transcription factors (TFs) involved in regulation of gene-specific expression are key players in the evolution of development. The COE family of transcription factors takes its name from the founding members of the family, Collier (Col) and Olfactory-1/Early-B-Cell Factor (Olf-1/EBF) isolated from Drosophila and rodents, respectively [1–3]. While there was no evidence for coe genes in either fungi, plants, or any of the various phyla of protozoans, identification of a cnidarian coe gene, Nvcoe, in the anthozoan sea anemone Nemostella vectensis, suggested that COE proteins have appeared with metazoa , a conclusion strengthened by the identification of COE members both in other cnidaria and porifera . While up to 4 ebf paralogs have been identified in vertebrates [6–8], a single coe member has been identified in all the other animals for which genome sequences have become available, suggesting that expansion of the coe gene family only occurred at the origin of vertebrates.
Expression profiles of coe genes in embryos from various protostomes and deuterostomes and N. vectensis have revealed a common feature, namely, an expression in subsets of sensory neurons [4, 9–14]. This feature raised the possibility that one ancestral role of COE proteins was to participate in the specification of specialised sensory cells and the ontogeny of an elaborate nervous system . However, genetic analyses performed in mice and, more recently, Drosophila raised the possibility that another ancestral function of COE proteins could have been in development of cellular immunity [15, 16]. The diversity of COE protein functions strikingly contrasts with the high degree of primary sequence conservation and lack of expansion of this family of TFs throughout metazoan evolution. Owing to its low complexity, the COE family constitutes a well-suited paradigm for studying evolution of TF structure and function, including the specific question of protein modularity.
Pioneering analysis of EBF identified three functional domains [1, 17]: an amino-terminal, about 210 amino-acid long DBD which is the signature of COE proteins; ii) a Helix-Helix dimerisation motif made of two tandemly arranged α-helical repeats showing limited sequence similarity to the HLH motif described in basic helix-loop-helix (b-HLH) proteins; iii) a transcription-activating domain without marked specific signature. The presence of an Ig-like/Plexin/Transcription Factor (IPT) domain between the DBD and HLH domains was also noticed but the function of this domain remains unknown [18, 19]. Comparison between Col and EBF showed that the DBD, IPT and HLH domains have been particularly well conserved during evolution. However, one of the tandemly arranged α-helices noted in EBF/Olf-1 was missing. This, and further examination of the Col and EBF primary sequences led us to postulate the existence, in all COE proteins, of an HLH motif distinct from and partly overlapping the motif initially identified in EBF and Olf-1 [2, 11]. This motif is designated below as H1LH2 while the vertebrate-specific motif is designated as H2d-H2a, H2d and H2a (d for duplicated, a for ancestral) corresponding to the duplication of the single H2 helix found in Drosophila.
To get more insight into the evolutionary history of coe genes, we compared in detail their genomic structure between various metazoan phyla. This comparison shows that the metazoan ancestor COE DBD was built from at least 7 separate exons with no evidence for exon shuffling with other gene families. Detailed analysis of various chordate genomes and ESTs indicated that the H2d duplication has taken place in the vertebrate lineage prior to the two rounds of whole genome duplication characterising the origin of this taxon . It thus provides an example of a novel vertebrate characteristic, which may have been fixed early in the gnathostome lineage. It also revealed that the vertebrate-specific H2 duplication originated from a two-step tandem exon duplication. Careful inspection of the intron phases leads us to put forward an original scenario that involves the selection of a new splice donor site, resulting in the formation of a "cassette" H2d exon. We show here that, alike EBF, Col does bind to DNA as homodimer and that the ancestral H1LH2 motif mediates formation of Col/EBF homodimers and heterodimers. Incorporation of H2d in all four mammalian EBF proteins reveals an interesting example of compulsory counter-selection of alternative splicing following exon duplication.
Search for COE/EBF related sequences in genomic and ESTdatabases
Systematic searches for COE/EBF proteins were conducted in available databases using the BLAST algorithm and mouse COE sequences (Additional file 1) as query. The databases analysed included current versions of the genomes of Monosiga brevicollis, Nematostella vectensis, Capitella capitata, Lottia gigantae, Branchiostoma floridae , of Strongylocentrotus purpuratus  as well as the Ensembl databases of predicted proteins of Ciona intestinalis, Homo sapiens, Mus musculus, Monodelphis domestica, Ornithorynchus anatinus, Gallus gallus, Xenopus tropicalis and Danio rerio . In Petromyzon marinus, the genomic scaffolds containing COE/EBF coding sequences were retrieved from the pre-Ensembl genome version available at . Coding sequences were identified in these scaffolds with GeneWise  and assembled using homologous sequences (>95% identity) identified from Lampetra fluviatilis ESTs as templates. A survey of ESTs annotated in the databanks for alternative transcript variants included the use of AceView . The sponge Amphimedon queenslandica COE sequence was taken from .
Molecular phylogenetic analysis
The alignment of Coe/EBF protein sequences was obtained using MUSCLE  and checked by hand under Bioedit . Only full length sequences and unambiguously aligned segments were retained for the phylogenetic analysis (see Additional file 2). Neighbor-Joining (NJ), Maximum likelihood (ML) and bayesian (BI) phylogenetic reconstructions were conducted using the Mega3.1 software, PhyML  and MrBayes 3.0 . In each case, we used the JTT model of sequence evolution with invariant+gamma distribution rates. Bootstrap proportions (BP) were calculated by analysis of 1000 replicates for NJ and by the RELL method  on the 2000 top-ranking trees for ML analyses. In the BI analysis, four chains were run for 2 million iterations with default heating parameters and sampled every 500 iterations; the first 2000 trees were discarded as burn-in.
In vitro translation and Electrophoretic Mobility Shift Assays
The pEThBF1 , pET15bHis-mEBF1 (a gift from J. Hagman) and pET17bHis Col plasmids and deletions therein were used for in vitro transcription/translation of EBF, EBF*, EBFΔH1, EBFΔH2, Col, Col*, ColΔH1, ColΔH2 and ColΔH2L. To generate internal deletions corresponding to the H1 and H2 helices, we used the four oligonucleotides PCR method . In vitro transcription and translation using rabbit reticulocyte lysate was as described by the manufacturer (kit L1170, Promega). For each protein synthesised, the efficiency of translation was assessed by SDS PAGE of parallel translation reactions performed in the presence of 35Smethionine. Electrophoretic mobility shift assays (EMSA) were performed in the conditions described by , using either a 125 bp DNA fragment containing mb-1 promoter sequences (from -250 to -115) which includes the EBF binding site 5'-AGACTCaaGGGAAT-3' or the PAL probe which contains the palindromic site 5'-ATTCCCaaGGGAAT-3' [1, 33] and data not shown. Competition experiments were performed using a 100× molar excess of 30 bp oligonucleotides containing either the wild type 5' -CTAGAGAGAGACTCAA GG GAATTGTGGCCAGCCC- 3' or mutated CTAGAGAGAGACTCAA CC GAATTGTGGCCAGCCC- 3' mb-1 recognition site, as described in .
The P [col5cDNA]; col1 strain designated in Fig.S4 as col1 and UAS-col strains have been described in . The P [col5cDNA] transgene rescues the embryonic lethality but not the wing defects of col1 mutants. The UAS-Mm ebf and UAS-Mm ebf2 constructs were made by cloning the entire ebf/ebf2 open reading frame in the pUAST vector. Three independent lines were used for ectopic expression assays. All other stocks were obtained from the Bloomington Stock Center and described in Flybase .
The coe gene family
The recent identification of coe sequences in cnidarians and poriferans [4, 5], together with the absence of evidence for coe genes outside metazoans suggests that COE proteins have appeared with this taxon. In line with this conclusion, no COE-related sequence could be identified in the genome of the choanoflagellate Monosiga brevicollis, while COE sequences have been reported in the sponge Amphimedon queenslandica and the sea anemone Nematostella vectensis [4, 5]. In order to obtain an exhaustive characterisation of COE proteins and their relationships in metazoans, we first updated the phylogenetic analysis available for this family , taking advantage of the wide range of genomes now available (Fig. 1 and see Additional file 2). Systematic searches for coe/ebf related sequences were carried out in the genome of a diploblast, two ecdysozoans (fly Drosophila melanogaster and nematode Caenorhabditis elegans), two lophochotrozoans (annelid Capitella capitata and mollusc Lottia gigantae), an echinoderm (sea urchin Strongylocentrotus purpuratus), the cephalochordate Branchiostoma floridae, the ascidian Ciona intestinalis and eight vertebrates, including the lamprey Petromyzon marinus, the platypus Ornithorhynchus anatinus and the oppossum Monodelphis domestica in addition to the mouse, human, chick, xenopus (X. tropicalis) and zebrafish (see Additional file 1 for accession numbers and nomenclature of each gene). A single coe gene was found in all metazoans studied, except in vertebrates, with evidence for three genes in Xenopus, and a fourth one in the four mammals studied as previously reported in the mouse and human , but also in the zebrafish. In P. marinus, two distinct clusters, spanning the 5' part of the coding region and including the HLH region were reconstructed from the genome, thus pointing to the presence of at least two coe genes in lampreys. The phylogenetic analysis was conducted using NJ, ML and bayesian algorithms, excluding highly divergent sequences as well as heavily truncated ones but retaining representatives of ecdysozoans, lophotrochozoans and of the major chordate taxa (see Additional file 2). In the resulting trees (Fig. 1), protostome, as well as lophotrochozoan coe sequences were found clustered in monophyletic groups in NJ, ML and BI, albeit with low statistical supports except in NJ (BP= 91 and 90 respectively). Similarly, the monophyly of the vertebrate sequences retained in the reconstruction was retrieved whatever the algorithm used, with moderate to good statistical supports in BI and ML (respectively PP = 1 and BP = 79; but BP = 50 in NJ), thus supporting and extending the results obtained by Pang et al., 2004. Inside this group, all three reconstruction methods also confirmed the previously reported presence of four additional monophyletic groups. Three of them contain at least one zebrafish or one Xenopus sequence in addition to one mammalian sequence (either COE1, COE2, or COE3). These groups were named accordingly COE1, COE2, both strongly supported (PP = 1 in BI, BP = 100 in NJ and ML), and COE3, less well supported in BI (PP = 0.38, but BP = 82 in ML and 100 in NJ). The fourth group (PP = 1 in BI, BP = 98 in ML and NJ) only contains mammalian sequences, clustering with mouse EBF4. Together with the identification of COE1, COE2, COE3 and EBF4 partial sequences in the platypus O. anatinus (excluded from the reconstruction due to truncations in the available sequences: see Additional file 3), this clearly indicates that the emergence of the EBF4 class has predated the mammalian radiation. The branching order observed for this group, always found as a sister group of all other vertebrate sequences (albeit with poor statistical supports), may be taken as evidence for an ancient origin in the vertebrate lineage, with subsequent losses in actinopterygians, amphibians and archosaurs (the three chick genes appearing respectively related to the COE1, COE2 and COE3 classes, see Additional file 3). However, a reconstruction artefact possibly related to the relatively long branches observed in this group among mammals remains difficult to exclude. Finally, the relative branching orders of the lamprey sequence included in the analyses (termed PmCOE-A) and of one of the zebrafish sequences (termed DrCOE) (Fig. 1) were found to vary depending on the algorithm and could not be resolved. Altogether, the phylogenomic analysis supports the conclusion that all vertebrate coe genes included in the reconstruction are derived from a single ancestral coe gene, present in the vertebrate lineage prior to the splitting between gnathostomes and cyclostomes. It also confirms the presence of four COE classes in gnathostomes. The emergence of three of them (COE1-3) is likely to have been linked to the two rounds of whole genome duplication that have occurred in the vertebrate lineage prior to the gnathostome radiation, while the origin of the fourth one (EBF4), which only contains mammalian sequences, is less clear. Finally, the chronology of the corresponding gene duplications relative to the cyclostome-gnathostome divergence, as well as the relationships of the lamprey genes with the four gnathostome classes, remained unresolved. Even though the number of genes identified in the lamprey and mammals (2 versus 4) is suggestive of the occurrence of a first round of duplication prior to the cyclostome-gnathostome splitting and a second one after their divergence, the phylogeny does not allow firm conclusions on this point.
The metazoan ancestor COE DBD was built from multiple "unique" exons
In vitro functional dissection of EBF, when EBF was a pioneer protein, delineated the COE DBD which turned out to be unrelated to other, previously characterised DBDs, except for the presence of a zinc coordination motif [1, 17]. Sequence conservation between EBF1 and Drosophila Col then showed that this DBD constituted the molecular trademark of a new family of transcription factors designated as COE proteins [2, 11]. Sequence alignments between representative members of different phyla highlight the high degree of evolutionary conservation of each of the three COE-specific domains, the DBD, IPT and atypical HLH domains as well as scattered blocks of sequence similarity in the carboxy-terminal transactivator domain (TAD) (Fig. 2A and see Additional file 2). Fig. 2b shows a diagrammatic comparison of the exon-intron structure of coe genes between representatives of deuterostomes, including the chordate (Craniata) Mus musculus, the urochordate ascidian C. intestinalis and the echinoderm Stongylocentrus purpuratus and representatives of protostomes, including the insect D. melanogaster and the nematode Caenorhabditis elegans and the cnidarian N. vectensis. An immediate outcome from this comparison was the remarkable conservation in number and positions of introns, independent of the overall size of the coe trancription units which ranges from around 10 kb in C. elegans  to around 400 kb in mouse ebf1 . This showed that the ancestor coe gene was built from a complex set of multiple exons, and that this highly fragmented structure was maintained throughout metazoan evolution. Functional domains in proteins are, on a large scale, associated with protein coding exons in the genome . It therefore came to us as a surprise to find that the COE DBD was constructed from at least seven separate exons, since introns are generally thought not have functions  although more recent reports tend to suggest that intron accumulation in conserved genes might be an adaptive process . Since each of the 6 introns interrupting the DBD is found at the same position and in the same phase in deuterostomes, protostomes and cnidarians, we conclude that this splice structure is ancestral with the variation of exon number observed, for example in Drosophila melanogaster and C. elegans reflecting secondary lineage-specific loss of introns (Fig. 2B and not shown; [39, 40]. Similarly, introns interrupting the IPT and HLH domains are also lost in D. melanogaster (Fig. 2b) Contrasting with the conservation of length and primary sequence of the DBD and the IPT+HLH domains, significant primary sequence variation between different phyla is observed at the junction between the DBD and IPT domains (see Additional file 2). This correlates with a variable position of the intron separating these two domains (intron i8 in ebf, Fig. 2B) within a "linker" region whose sequence and length is itself variable (see Additional file 3). Further sequence variation in this region is conferred by the use of two different E8 splice donor sites, resulting in EBF isoforms differing by the inclusion or not of a 8 to 10 amino acids, the possible functional consequences of which remains to be addressed. Domain exchange and/or accretion between proteins resulting from exon shuffling is believed to be one of the driving forces behind protein evolution. During this process, symmetrical exons,i.e. exons flanked by same-phase introns, can be either deleted, duplicated or inserted, without disrupting the downstream protein reading frame . Since, except for exon E6, all the exons contributing the COE DBD are asymmetrical (Fig. 2B), this domain could not be constructed from accretion of subdomains present in other bilaterian proteins through exon shuffling. Consistent with this conclusion, systematic blast-search analyses with individual DBD exons failed to retrieve proteins other than COE proteins from databanks. A comprehensive theory to explain intron abundance and high level of position conservation among species is still missing . How the COE-specific DBD structure was put together in first place remains therefore a fascinating question.
The modified HLH motif of EBF proteins is a vertebrate innovation
A noticeable difference between EBF and Drosophila Col is the specific duplication of a short α-helical region in EBF (H2d-H2a tandem repetition, Fig. 2A,C). Because of remote sequence similarity, this H2 tandem repetition was originally proposed to constitute a dimerisation similar to that present in b-HLH proteins found in fungi, plants and metazoans [1, 3, 43]. The absence of H2d in Drosophila Col led us, however, to propose the existence of an alternative Helix-Linker-Helix (H1LH2) domain [2, 11]. Sequence comparison of a wide range of metazoans shows that the H1LH2 motif is an ancestral character (see Additional file 2). The predicted primary sequence of COE proteins in cephalochordates and urochordates, which are considered as the closest living relatives of the vertebrate ancestor [12, 44, 45], suggested that the H2d-H2a duplication was a vertebrate-specific feature. To confirm a conclusion mostly based on ESTs analysis, we retrieved the intronic sequences comprised between the H1 and H2a coding exons in representatives of the major chordate groups outside vertebrates, C. intestinalis, B. floridae and S. purpuratus and verified the absence of H2d-related coding sequence (see Additional file 3). In contrast, all the gnathostome sequences retrieved from the genomes analysed, including not only the four COE1-3 and EBF4 classes but also the unassigned zebrafish DrCOE sequence, exhibit the H2d addition, strongly suggesting that the H2a-H2d duplication predated the gnathostome radiation (see Additional file 2). In the lamprey P. marinus, we found no evidence for the presence of H2d in the Pmcoe-A locus, but a definitive conclusion could not be obtained in this case due to sequence gaps between the H1 and H2a coding regions in the current genome version. In contrast, the presence of H2d could be unambiguously recognised, and at the expected position, in the deduced PmCOE-B amino acid sequence (see Additional file 3). Preliminary EST analyses of a closely related lamprey species, Lampetra fluviatilis, confirmed the presence of H2d in transcripts of the orthologous Lfcoe-B gene but also highlighted the presence of alternatively spliced forms, devoid of the duplicated H2 sequence (see Additional file 4). The presence of H2d in lamprey LCOE-B may thus be subject to alternative splicing, while no indication for a similar process has been obtained thus far in gnathostomes. Taken together, these data indicate that the H2 duplication occurred early in the vertebrate lineage, prior to splitting between gnathostomes and cyclostomes, in a single copy ancestral gene from which all gnathostome and at least the lamprey coe-A genes are derived. They also suggest that this additional protein domain may have been fixed early in the gnathostome lineage.
The COE-HLH dimerisation motif revisited
EBF/Olf-1 was initially isolated as a nuclear factor recognising functionally important cis-regulatory DNA sequences in the promoter of mb-1, an early B-lymphocyte specific gene and olfactory marker protein genes [46, 47]. Further characterisation showed that EBF/Olf-1 recognises variations on the palindromic sequence TTCCCNNGGGAAT and binds DNA as a homodimer [1, 3, 33]. It was thus proposed that homodimer formation was mediated by the H2d-H2a α-helical repetition [1, 3, 17] (see Fig. 2A). The absence of H2d in COE proteins outside vertebrates (Fig. 2C) raised, however, the question of whether all COE proteins could bind DNA as dimers and, if so, which motif was involved in dimer formation. We therefore proposed that the H1LH2a motif that is found in all COE proteins was playing this role. To experimentally address this question, we compared the dimerization properties of Drosophila Col and EBF and modified versions of these two proteins (Fig. 3A and see Additional file 5), using gel-shift assays with the vertebrate mb-1 promoter DNA. We first found that Col forms complexes with mb-1 DNA that migrate at the same position than EBF/mb-1 complexes, without evidence of fast migrating complexes which would correspond to monomers (Fig. 3B). We could therefore conclude that, similar to EBF, Col binds to DNA as homodimer. Truncated forms of EBF and Col, that lack the transactivation domain (designated below as EBF* and Col*, respectively, Fig. 3A) form complexes of higher mobility than the full-length proteins (, Fig. 3B,D). We took advantage of this higher mobility to assay Col ability to form heterodimers with EBF, using a mixture of full length Col or EBF and EBF*. The formation of three types of DNA/protein complexes (Fig. 3B,C) indicated that Col is able to form heterodimers with EBF. Interestingly, Col/EBF heterodimer formation was favoured over homodimer formation (Fig. 3B,C and data not shown). Since binding to mb-1 DNA is an indirect assay for dimer formation, this observation which could indicate either favoured heterodimerization or higher DNA binding affinity of heterodimers needs to be further investigated. Above all, these data indicated that H2d is not required for dimerisation of COE proteins. Conversely, a truncated EBF-4 protein containing H2d but lacking H2a (OE-4S) was reported to bind to DNA as homodimer . Together, these data allow to conclude that the presence of a single copy of H2 is sufficient for binding of COE proteins to bind DNA as dimers. We then tested the specific requirement both for H1 and H2 (H2a or H2a and H2d), by precisely removing either helical domain in Col or EBF* (ColΔH1/ColΔH2a and EBF*ΔH1/EBF*ΔH2 proteins, respectively Fig. 3A). Neither deleted protein form was able to bind to DNA, when co-expressed with either EBF* or Col* (Fig. 3C,D and not shown). This led us to conclude that the presence of both H1 and at least one H2 are essential for COE dimer formation and that the ancestral H1-L-H2a mediate dimerisation of COE proteins.
A two-step evolutionary scenario for inclusion of H2d in vertebrate EBF
All gnathostome coe/ebf cDNAs analysed to date include H2d. To investigate the possible mechanisms behind this inclusion, we compared in detail the genomic structure of the HLH region between gnathostomes and their closest relatives, the urochordates , for which genomic sequences are available. Each of the 3 α-helical repeats present in human EBF (H1, H2d and H2a) is encoded by a separate exon, (exons E11, E12 and E13, separated by introns i11 and i12, respectively, Fig. 2B,C). An intron also found between H1 and H2a in C. intestinalis Ci-coe, designated i11-12 must have predated the separation between urochordates and vertebrates. The presence of this intron in other deuterostomes, the cephalochordate B. floridae and the echinoderm Strongylocentrus purpuratus but also the cnidarian N. vectensis (Fig. 2B) confirmed its ancestral character (Fig. 4A). Both i11, i12 and i11/12 are phase 0 introns, which could indicate a simple scenario whereby a tandem duplication of the exon encoding ancestral H2a would be at the origin of vertebrate H2d. Such a straightforward scenario is not compatible, however, with the position and phase of the 3'-next intron (i13) which, both in C. intestinalis and mammals, is neither situated immediately downstream of the H2 coding sequence nor a phase 0 but a phase 1 intron (Fig. 2B and 4A,B). Since only symmetrical exons can be inserted into introns, of the same phase, without disrupting the downstream reading frame, this observation rules out a simple exon E12 duplication and implies a secondary event. We therefore propose the following two-steps model: First, a duplication of exon E13 (Fig. 4B) led to a situation where either one of the ancestral or duplicated exon, but not both, could be incorporated in the mature mRNA, a classical case of mandatory alternative splicing [48, 49]. Second, selection of a new splice donor site, a few nucleotides downstream of the H2d coding region, restored a phase 0 intron (Fig. 4C), allowing both H2a and H2d to be inserted into the reading frame (Fig. 4D). Other possible scenarii were envisaged, using different intronic recombination events, but none appeared to be more parsimonious than the two-step scenario that we put forward here. Of the two vertebrate H2, the C-terminal is the more closely related to the single H2 of invertebrates (Fig. 2D and S2), indicating that the duplicated helix has started to diverge. The "cassette" H2d exon can theoretically be inserted or removed from the transcript without affecting the rest of the protein. Removal of the H2d exon from the mature ebf mRNAs through exon-skipping (Fig. 4D, dashed line) would restore an invertebrate-like protein. However, we could not find evidence for H2d exon skipping in gnathostome COE proteins, either by surveying ESTs annotated in the databanks (AceView; ) or by PCR amplification of this coding region in mouse coe1, coe2 and coe4, using specifically designed primers with mRNA from several different tissues (data not shown). Therefore we conclude that in gnathostomes, the prevalent form of COE proteins results from the compulsory inclusion of H2d. This inclusion therefore represents an interesting case of counter-selection of exon-skipping, a mechanism widely used in vertebrates to amplify the register of protein products and their differential expression during development [49, 50].
The COE family of transcription factors was first defined by the sequence similarity between rodent EBF/Olf-1 and Drosophila Col [1–3]. Cloning of a coe cDNA from the cnidarian N. vectensis and identification of coe sequences in another cnidarian, hydra magnipapillata and a poriferan, the sponge Amphimedon queenslandica [4, 5] strengthened the conclusion that coe genes are metazoan-specific genes. Our systematic blast-search for coe orthologs in DNA sequence databanks confirmed that coe genes are metazoan genes present at a single copy per genome, except for vertebrates. It further showed a remarkable degree of conservation of the coe genomic structure throughout metazoan evolution, except for one exon duplication in the vertebrate lineage.
The scattered structure of coe genes
All introns found in the cnidaria N. vectensis Nvcoe gene are also found, at the same position, in deuterostomes and at least one of the protostomes examined, suggesting that this scattered organisation was already present in the metazoan ancestral coe gene. In case of the DBD, which is both specific of COE proteins and conserved to the same degree over its entire length, a split structure into 7 exons was rather unexpected. Moreover, we could not find evidence for exon shuffling with other gene families, consistent with the conserved asymmetric intron phases, but leaving intact the question of the genomic building up of this unique DNA binding domain. The HCCC zinc finger structure proposed to be an essential feature of the EBF DNA-binding domain  is itself encoded by two exons, already in the last common cnidarian/bilaterian ancestor (E5 and E6, Fig. 2B), suggesting a bipartite origin. Since exon E6 is symmetrical, it can possibly be subject to regulated exon-skipping, allowing for the production of different protein isoforms, with putatively different functions. Whereas there is some preliminary evidence for it, as a subclass of human EBF1 cDNAs may differ from the main class by the loss of exon E6 (AceView;  the i5 intron has been lost in some protostomes, such as Drosophila melanogaster (Fig. 2B). Systematic genome sequencing programs should soon give access to the coe gene structure in many additional phyla, including sister clades of bilateriae. It offers the exciting prospect of deeper insight into the evolutionary roots of the coe gene family and their scattered genomic organisation.
The ancestral COE HLH motif revisited
Sequence similarity of the ancestral COE H1LH2 motif with the HLH motif of basic-HLH proteins  has led to classify COE proteins as one distant subgroup in this superfamily of proteins, despite displaying distinctive DBD and additional protein domains [5, 51]. In vitro DNA binding assays show that the H1LH2 motif is required for binding of COE proteins to DNA as dimers. This conclusion differs from the initial report that the EBF dimerisation motif was H2d-H2a, a conclusion supported by the analysis of two different deletions in EBF. Indeed, an internal deletion of EBF removing amino acids 296 to 367 (EBFΔ296–367), namely H1 and part of the IPT domain, was reported to lower but not prevent dimer formation . Since we found that removal of H1 alone in either EBF or Col abolished dimer formation, one possibility is that the presence of the IPT domain interferes with the ability of the H2d-H2a repeat to mediate homophilic interactions. In support of this possibility, the H2d-H2a repeat, when taken out of its normal context, is able to promote formation of dimers, as shown by using a truncated nuclear hormone receptor lacking its own dimerization domain ). The high degree of sequence conservation of the COE IPT domain (see Additional file 3) suggests that this domain is subject to very stringent structural and functional constraints. Together, our results from DNA-binding assays and those reported by , further suggest that the positioning of the IPT and HLH domains in relation to one another is a critical aspect of COE dimer formation. Hagman et al; 1995  also reported that a modified EBF protein lacking amino acids 370 to 383 (EBFΔ370–383), i.e., part of H2d, leaving intact H2a (see Additional file 5), showed a drastically reduced level of binding to mb1 DNA, suggesting that H2d was essential for forming EBF homodimers. Yet, the 370 to 383 a.a. deletion does not only remove part of H2d but also part of the linker separating H1 and H2 (see Additional file 5). Our data suggest that it is removal of this linker rather than H2d itself which prevents EBF dimer formation. The conservation of sequence and genomic structure of this Proline-rich linker throughout metazoan evolution (see Additional files 2 and 3) supports a key role in positioning H1 and H2 relative to each other and contribution to the DNA-binding specificity of COE dimers. While efficient in vitro binding to DNA of either Col dimers, Col/EBF heterodimers or dimers of EBF isoforms lacking either H2a (Fig. 3) or H2d  indicates that the H2 duplication is not essential for EBF dimer formation, inclusion of a duplicated helix2 raises the interesting possibility that it could result in an increased partnership flexibility and functional versatility of the vertebrate COE proteins. The observation that Col/EBF heterodimers more efficiently form and/or bind to DNA, at least in vitro, raises the speculative hypothesis that it could have been the initial force behind the selection of H2 exon inclusion.
Counter-selection of alternative splicing
Together, the compared structures of vertebrate and urochordate coe genes between echinoderms, cephalochordates, urochordates and a wide range of vertebrates, including cyclostomes suggest that the duplication of H2 occurred in the vertebrate ancestor and resulted from an exon-duplication event. This is the only major change in the modular structure of COE proteins that appears to have been fixed throughout metazoan evolution. Exon duplication is one widely used mechanism for adding a coding region within an existing gene. Alternative splicing of duplicated exons has been postulated to favor protein diversification, since each exon can, in principle, evolve independently of the other [48, 49]. Recent genomic studies have suggested that 40–60% of human genes are alternatively spliced and comparative analysis of close to 10,000 orthologous genes in human and mouse has shown that alternative splicing is frequently associated with recent exon creation and/or loss . However, other studies suggest that the contribution of gene duplication, followed by sequence divergence and alternative splicing to the diversification of the protein repertoire could be substantially different . In the case of the vertebrate coe genes, alternative splicing was not selected by evolution following exon H2d duplication, since both H2 repeats are incorporated in the EBF proteins. Taking into account the splice frame rules, we put forward here an original two-step model to account for the inclusion of H2d in vertebrate COE proteins (Fig. 4). The first step in our model is a classical tandem duplication of an "ancestral" H2a coding exon. However, this exon was probably not symmetrical (see Fig. 2B) and, due the splice frame rule, only the ancestral or the duplicated exon could be incorporated in the coding transcript without disrupting the open reading frame, a classical case of mandatory alternative splicing [48, 49]. We believe that inclusion of the duplicated exon occured via the activation of a phase 0 splice donor site, 3' to H2 in the duplicated exon This allowed the incorporating of H2d, while preserving the open reading frame (see Fig. 4D). To our knowledge such a two-step selection of a cassette exon has not yet been invoked for other proteins.
While our data underline the conservation of coe protein structure throughout evolution, the molecular mechanisms underlying the cell-context dependence of COE regulatory targets remains unknown. For example, mouse EBF/COE1 or EBF2/COE2 can substitute for Col activity in UAS-Gal4 transgenic assays , using as a paradigm Col function in patterning of the wing , indicating that Col and EBF are able to regulate similar set of genes in a tissue-dependent manner (see Additional file 6). So far, little insight was obtained from systematic searches for EBF or Col directly protein interactors [55, 56]. This remains a pre-eminent question in view of the evolutionary diversification of the biological functions of COE proteins revealed by mutant analyses in both mouse, C. elegans and Drosophila [57–60]. Within this context, more extensive analysis of genomic structure, expression and function of COE proteins in other phyla could be of primary interest.
Our systematic blast-search for coe (collier/olf-1/ebf) orthologs in DNA sequence databanks confirmed that coe genes are metazoan genes present at a single copy per genome, except for vertebrates. It further showed a remarkable degree of conservation of the coe genomic structure throughout metazoan evolution, except for one exon duplication in the vertebrate lineage, leading to a modified dimerisation domain of structure H1lH2dH2a in vertebrates and HLH2a in all other metazoans. Taking into account the splice frame rules, we put forward here an original two-step duplication model to account for H2d inclusion in vertebrate COE proteins The vertebrate gene configuration is such that it remains possible to remove H2d through alternative splicing, through exon-skipping. However, the presence of both H2d and H2a in all gnathostome coe/ebf transcripts characterised to date both indicates that, in this case, exon-skipping is highly counter-selected. While in vitro experiments indicate that the H2 duplication is not essential for binding of COE proteins to DNA as dimers, it raises the interesting possibility that it could result in an increased partnership flexibility and functional versatility of the vertebrate COE proteins.
Hagman J, Belanger C, Travis A, Turck CW, Grosschedl R: Cloning and functional characterization of early B-cell factor, a regulator of lymphocyte-specific gene expression. Genes Dev. 1993, 7: 760-73. 10.1101/gad.7.5.760.
Crozatier M, Valle D, Dubois L, Ibnsouda S, Vincent A: Collier, a novel regulator of Drosophila head development, is expressed in a single mitotic domain. Curr Biol. 1996, 6: 707-18. 10.1016/S0960-9822(09)00452-7.
Wang MM, Reed RR: Molecular cloning of the olfactory neuronal transcription factor Olf-1 by genetic selection in yeast. Nature. 1993, 364: 121-6. 10.1038/364121a0.
Pang K, Matus DQ, Martindale MQ: The ancestral role of COE genes may have been in chemoreception: evidence from the development of the sea anemone, Nematostella vectensis (Phylum Cnidaria; Class Anthozoa). Dev Genes Evol. 2004, 214: 134-8. 10.1007/s00427-004-0383-7.
Simionato E, Ledent V, Richards G, Thomas-Chollier M, Kerner P, Coornaert D, Degnan BM, Vervoort M: Origin and diversification of the basic helix-loop-helix gene family in metazoans: insights from comparative genomics. BMC Evol Biol. 2007, 7: 33-10.1186/1471-2148-7-33.
Garel S, Marin F, Mattei MG, Vesque C, Vincent A, Charnay P: Family of Ebf/Olf-1-related genes potentially involved in neuronal differentiation and regional specification in the central nervous system. Dev Dyn. 1997, 210: 191-205. 10.1002/(SICI)1097-0177(199711)210:3<191::AID-AJA1>3.0.CO;2-B.
Wang SS, Betz AG, Reed RR: Cloning of a novel Olf-1/EBF-like gene, O/E-4, by degenerate oligo-based direct selection. Mol Cell Neurosci. 2002, 20: 404-14. 10.1006/mcne.2002.1138.
SS Tsai, Wang RY, Reed RR: The characterization of the Olf-1/EBF-like HLH transcription factor family: implications in olfactory gene regulation and neuronal development. J Neurosci. 1997, 17: 4149-58.
Dubois L, Bally-Cuif L, Crozatier M, Moreau J, Paquereau L, Vincent A: XCoe2, a transcription factor of the Col/Olf-1/EBF family involved in the specification of primary neurons in Xenopus. Curr Biol. 1998, 8: 199-209. 10.1016/S0960-9822(98)70084-3.
Prasad BC, Ye B, Zackhary R, Schrader K, Seydoux G, Reed RR: unc-3, a gene required for axonal guidance in Caenorhabditis elegans, encodes a member of the O/E family of transcription factors. Development. 1998, 125: 1561-8.
Dubois L, Vincent A: The COE–Collier/Olf1/EBF–transcription factors: structural conservation and diversity of developmental functions. Mech Dev. 2001, 108: 3-12. 10.1016/S0925-4773(01)00486-5.
Mazet F, Masood S, Luke GN, Holland ND, Shimeld SM: Expression of AmphiCoe, an amphioxus COE/EBF gene, in the developing central nervous system and epidermal sensory neurons. Genesis. 2004, 38: 58-65. 10.1002/gene.20006.
Wang SS, Lewcock JW, Feinstein P, Mombaerts P, Reed RR: Genetic disruptions of O/E2 and O/E3 genes reveal involvement in olfactory receptor neuron projection. Development. 2004, 131: 1377-88. 10.1242/dev.01009.
Kim K, Colosimo ME, Yeung H, Sengupta P: The UNC-3 Olf/EBF protein represses alternate neuronal programs to specify chemosensory neuron identity. Dev Biol. 2005, 286: 136-48. 10.1016/j.ydbio.2005.07.024.
Lin H, Grosschedl R: Failure of B-cell differentiation in mice lacking the transcription factor EBF. Nature. 1995, 376: 263-7. 10.1038/376263a0.
Crozatier M, Ubeda JM, Vincent A, Meister M: Cellular immune response to parasitization in Drosophila requires the EBF orthologue collier. PLoS Biol. 2004, 2: E196-10.1371/journal.pbio.0020196.
Hagman J, Gutch MJ, Lin H, Grosschedl R: EBF contains a novel zinc coordination motif and multiple dimerization and transcriptional activation domains. Embo J. 1995, 14: 2907-16.
Bork P, Doerks T, Springer TA, Snel B: Domains in plexins: links to integrins and transcription factors. Trends Biochem Sci. 1999, 24: 261-3. 10.1016/S0968-0004(99)01416-4.
Liberg D, Sigvardsson M, Akerblad P: The EBF/Olf/Collier family of transcription factors: regulators of differentiation in cells originating from all three embryonal germ layers. Mol Cell Biol. 2002, 22: 8389-97. 10.1128/MCB.22.24.8389-8397.2002.
Dehal P, Boore JL: Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005, 3: e314-10.1371/journal.pbio.0030314.
Eukaryotic Genomics. [http://genome.jgi-psf.org/]
Sea Urchin Genome Project. [http://www.hgsc.bcm.tmc.edu/projects/seaurchin/]
Ensembl Genomes. [http://www.ensembl.org/index.html]
Pre!Ensembl lamprey. [http://pre.ensembl.org/Petromyzon_marinus/index.html]
Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res. 2004, 14: 988-95. 10.1101/gr.1865504.
Thierry-Mieg D, Thierry-Mieg J: AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006, 7 (Suppl 1): 1-14. 10.1186/gb-2006-7-s1-s12.
Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-7. 10.1093/nar/gkh340.
Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.
Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19: 1572-4. 10.1093/bioinformatics/btg180.
Kishino H, Hasegawa M: Converting distance to time: application to human evolution. Methods Enzymol. 1990, 183: 550-70.
Ho SN, Hunt HD, Horton RM, Pullen JK, Pease LR: Site-directed mutagenesis by overlap extension using the polymerase chain reaction. Gene. 1989, 77: 51-9. 10.1016/0378-1119(89)90358-2.
Travis A, Hagman J, Hwang L, Grosschedl R: Purification of early-B-cell factor and characterization of its DNA-binding specificity. Mol Cell Biol. 1993, 13: 3392-400.
Crozatier M, Glise B, Vincent A: Connecting Hh, Dpp and EGF signalling in patterning of the Drosophila wing; the pivotal role of collier/knot in the AP organiser. Development. 2002, 129: 4261-9.
Doolittle RF: The multiplicity of domains in proteins. Annu Rev Biochem. 1995, 64: 287-314. 10.1146/annurev.bi.64.070195.001443.
Lynch M, Conery JS: The origins of genome complexity. Science. 2003, 302: 1401-4. 10.1126/science.1089370.
Carmel L, Rogozin IB, Wolf YI, Koonin EV: Evolutionarily conserved genes preferentially accumulate introns. Genome Res. 2007, 17: 1045-50. 10.1101/gr.5978207.
Carmel L, Wolf YI, Rogozin IB, Koonin EV: Three distinct modes of intron dynamics in the evolution of eukaryotes. Genome Res. 2007, 17: 1034-44. 10.1101/gr.6438607.
Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV: Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Curr Biol. 2003, 13: 1512-7. 10.1016/S0960-9822(03)00558-X.
Patthy L: Intron-dependent evolution: preferred types of exons and introns. FEBS Lett. 1987, 214: 1-7. 10.1016/0014-5793(87)80002-9.
Roy SW, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006, 7: 211-21.
Massari ME, Murre C: Helix-loop-helix proteins: regulators of transcription in eucaryotic organisms. Mol Cell Biol. 2000, 20: 429-40. 10.1128/MCB.20.2.429-440.2000.
Schubert M, Holland ND, Escriva H, Holland LZ, Laudet V: Retinoic acid influences anteroposterior positioning of epidermal sensory neurons and their gene expression in a developing chordate (amphioxus). Proc Natl Acad Sci USA. 2004, 101: 10320-5. 10.1073/pnas.0403216101.
Delsuc F, Brinkmann H, Chourrout D, Philippe H: Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature. 2006, 439: 965-8. 10.1038/nature04336.
Hagman J, Travis A, Grosschedl R: A novel lineage-specific nuclear factor regulates mb-1 gene transcription at the early stages of B cell differentiation. Embo J. 1991, 10: 3409-17.
Kudrycki K, Stein-Izsak C, Behn C, Grillo M, Akeson R, Margolis FL: Olf-1-binding site: characterization of an olfactory neuron-specific promoter motif. Mol Cell Biol. 1993, 13: 3002-14.
Kondrashov FA, Koonin EV: Origin of alternative splicing by tandem exon duplication. Hum Mol Genet. 2001, 10: 2661-9. 10.1093/hmg/10.23.2661.
Letunic I, Copley RR, Bork P: Common exon duplication in animals and its role in alternative splicing. Hum Mol Genet. 2002, 11: 1561-7. 10.1093/hmg/11.13.1561.
Graveley BR: Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 2001, 17: 100-7. 10.1016/S0168-9525(00)02176-4.
Ledent V, Vervoort M: The basic helix-loop-helix protein family: comparative genomics and phylogenetic analysis. Genome Res. 2001, 11: 754-70. 10.1101/gr.177001.
Modrek B, Lee CJ: Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat Genet. 2003, 34: 177-80. 10.1038/ng1159.
Talavera D, Vogel C, Orozco M, Teichmann SA, de la Cruz X: The (in)dependence of alternative splicing and gene duplication. PLoS Comput Biol. 2007, 3: e33-10.1371/journal.pcbi.0030033.
Brand AH, Perrimon N: Targeted gene expression as a means of altering cell fates and generating dominant phenotypes. Development. 1993, 118: 401-15.
Formstecher E, Aresta S, Collura V, Hamburger A, Meil A, Trehin A, Reverdy C, Betin V, Maire S, Brun C, et al: Protein interaction mapping: a Drosophila case study. Genome Res. 2005, 15: 376-84. 10.1101/gr.2659105.
Tsai RY, Reed RR: Identification of DNA recognition sequences and protein interaction domains of the multiple-Zn-finger protein Roaz. Mol Cell Biol. 1998, 18: 6447-56.
M Kieslinger, Folberth S, Dobreva G, Dorn T, Croci L, Erben R, Consalez GG, Grosschedl R: EBF2 regulates osteoblast-dependent differentiation of osteoclasts. Dev Cell. 2005, 9: 757-67. 10.1016/j.devcel.2005.10.009.
Croci L, Chung SH, Masserdotti G, Gianola S, Bizzoca A, Gennarini G, Corradi A, Rossi F, Hawkes R, Consalez GG: A key role for the HLH transcription factor EBF2COE2, O/E-3 in Purkinje neuron migration and cerebellar cortical topography. Development. 2006, 133: 2719-29. 10.1242/dev.02437.
Baumgardt M, Miguel-Aliaga I, Karlsson D, Ekman H, Thor S: Specification of neuronal identities by feedforward combinatorial coding. PLoS Biol. 2007, 5: e37-10.1371/journal.pbio.0050037.
Lagergren A, Mansson R, Zetterblad J, Smith E, Basta B, Bryder D, Akerblad P, Sigvardsson M: The Cxcl12, periostin, and Ccl9 genes are direct targets for early B-cell factor in OP-9 stroma cells. J Biol Chem. 2007, 282: 14454-62. 10.1074/jbc.M610263200.
We are grateful to Patrick Wincker and Corinne Da Silva (Genoscope and UMR 8030) for help with lamprey ESTs sequencing. We also thank Julian Smith and Jean Deutsch for critical reading of early versions of the manuscript and Serge Plaza and members of our laboratory for discussion. This research was supported by CNRS and Ministère de la Recherche (ACI BCMS) and CNRG. S. Mella was supported by a doctoral fellowship from Ministère de la Recherche.
VD and SeM carried out the experimental work, SeM, J–LP and SyM carried our the phylogenic analyses, SeM, MC and AV contributed the conceptual framework and SyM and AV wrote the manuscript. All authors read and approved the final manuscript.
Virginie Daburon, Sébastien Mella contributed equally to this work.
Electronic supplementary material
Additional file 5: Diagrammatic alignment of EBF and Col protein amino-acid sequences. The sequence alignment provided shows the position of the H1 and H2 deletions introduced in modified versions of EBF and Col used in DNA-binding assays (Fig. 3) and EBF internal deletion described in Hagman et al., 1995 . (PDF 625 KB)
About this article
Cite this article
Daburon, V., Mella, S., Plouhinec, J. et al. The metazoan history of the COE transcription factors. Selection of a variant HLH motif by mandatory inclusion of a duplicated exon in vertebrates. BMC Evol Biol 8, 131 (2008) doi:10.1186/1471-2148-8-131
- Splice Donor Site
- Vertebrate Lineage
- Exon Shuffling
- Metazoan Evolution
- Intron Phase