- Methodology article
- Open Access
- Published:

# Core set approach to reduce uncertainty of gene trees

*BMC Evolutionary Biology*
**volume 6**, Article number: 41 (2006)

## Abstract

### Background

A genealogy based on gene sequences within a species plays an essential role in the estimation of the character, structure, and evolutionary history of that species. Because intraspecific sequences are more closely related than interspecific ones, detailed information on the evolutionary process may be available by determining all the node sequences of trees and provide insight into functional constraints and adaptations. However, strong evolutionary correlations on a few lineages make this determination difficult as a whole, and the maximum parsimony (MP) method frequently allows a number of topologies with a same total branching length.

### Results

Kitazoe et al. developed multidimensional vector-space representation of phylogeny. It converts additivity of evolutionary distances to orthogonality among the vectors expressing branches, and provides a unified index to measure deviations from the orthogoality. In this paper, this index is used to detect and exclude sequences with large deviations from orthogonality, and then selects a maximum subset ("core set") of sequences for which MP generates a single solution. Once the core set tree is formed whose all the node sequences are given, the excluded sequences are found to have basically two phylogenetic positions on this tree, respectively. Fortunately, since multiple substitutions are rare in intra-species sequences, the variance of nucleotide transitions is confined to a small range. By applying the core set approach to 38 partial *env* sequences of HIV-1 in a single patient and also 198 mitochondrial COI and COII DNA sequences of *Anopheles dirus*, we demonstrate how consistently this approach constructs the tree.

### Conclusion

In the HIV dataset, we confirmed that the obtained core set tree is the unique maximum set for which MP proposes a single tree. In the mosquito data set, the fluctuation of nucleotide transitions caused by the sequences excluded from the core set was very small. We reproduced this core-set tree by simulation based on random process, and applied our approach to many sets of the obtained endpoint sequences. Consequently, the ninety percent of the endpoint sequences was identified as the core sets and the obtained node sequences were perfectly identical to the true ones.

## Background

Molecular phylogeny at the intraspecific level has proven to be very useful for a variety of studies, such as estimating the genealogical relationships among genes, determining the common ancestor of a group of organisms, evaluating the development of drug resistance in infectious diseases, detecting positive selection for immune escape, and predicting future trends of diseases [1–7]. The sequences within a species are similar each other compared with those from different species. Therefore, it is generally expected that the pairwise difference is a good approximation to the evolutionary distance, and the estimated phylogeny by existing methods (maximum parsimony (MP) [8, 9], statistical parsimony (SP) [13–17] and maximum likelihood (ML) [11, 12]) is reliable and has little uncertainty. However, there appear sometimes abnormal phenomena in which MP picks up a prohibitive number of equally parsimonious trees [13]. Such phenomena suggest the presence of strong evolutionary correlations among different lineages. Then, a nested analysis approach [18–24] makes a complicated network mapping in a nested analysis and makes it impossible to trace the evolutionary history. In this situation, when the approximation of the pairwise difference is well satisfied as a whole, we can expect that the above abnormal phenomena are caused by a small number of sequences in a given dataset.

Recently, Kitazoe et al. [25, 26] proposed multidimensional vector-space (MVS) representation in which the estimated pairwise distances can be reproduced by using Pythagorean theorem. When a distance matrix is compatible with a tree structure, sequences are expressed by the composites of branch vectors that are orthogonal among others. Therefore, MVS representation enables to measure the deviation from the tree structure which subsists in the initial pairwise differences. In this paper, this measured deviation serves as a unified index in the core set approach developed below.

We principally select a maximum subset ("core set") of sequences for which MP generates a single solution, by excluding sequences with large values of the deviation index, and we can uniquely determine all the node sequences of the core set tree. Here, we assume that this core set tree is robust and preserved by insertion of the excluded sequences. This assumption seems to be reasonable when the number of the excluded sequences is small. The core set approach makes the reason of the exclusion clear, and therefore makes it possible to estimate the nucleotide transitions within a small range of uncertainty. Indeed, we found through applications that the excluded sequences without recombination have two phylogenetic positions, respectively. A more detailed procedure of the core set approach is documented in the followings:

We first select a maximum number ("primary core (P-core) set") of sequences (the procedure (a) in Methods) whose distance sub-matrix gives no deviation from additivity of evolutionary distances. Existing methods (MP, SP, ML and NJ) produce the same single tree as the P-core set tree. Second, the excluded sequences are added on the P-core set (the procedure (b) in Methods). This insertion provides a maximum set ("secondary core (S-core) set") of sequences for which MP produces a single tree, and the S-core set comprises a global mainframe of the final tree structure. However, long branching attractions may sometimes prevent a further restore of the excluded sequences into the S-core set. In such a case, we divide the S-core set into several subgroups and the two procedures of exclusion and insertion are repeated in the subgroups on the additional condition of preserving the S-core set tree structure.

We apply the core set method to partial *env* sequences of HIV-1 (C2-V5), to demonstrate how our approach realizes reasonable tree building. We next employ the method for a phylogenetic analysis of mitochondrial sequences of *Anopheles dirus*, which characterize a simultaneous radiation pattern. Then, we perform simulation studies and confirm that our core-set is mostly a large proportion of the original dataset and that the inferred tree for the S-core set is consistent with the true tree of the whole sequences.

## Results

### 1) HIV phylogeny inference by constructing only a S-core set tree

We applied our method to partial *env* sequences of HIV-1 (C2-V5) that were obtained in a longitudinal study [27] of many patients. We analyzed a subgroup of 38 nucleotide sequences from the patient-1, which comprise an obvious monophyletic tree and are separated from the other subgroups in MP consensus tree. Hence, there unlikely exist substitutions which make phylogenetic inference much difficult and this data set provided a good example to demonstrate the validity of our method (these sequences and their accession numbers are listed in Additional file 1 1). We first selected a P-core set of sequences, which made the total deviation index W (Eq. (3) in Methods) equal to zero by performing the procedure (a) in Methods. The sequence number *1*–*21* in Fig. 1b stands for this P-core set member. Here, standard methods such as NJ, MP and SP provided the same tree, because there were no multiple substitutions in the P-core set. We next tried to incorporate the excluded 17 sequences sequentially into the P-core set by performing the procedure (b) in Methods. As a result, the sequence number *22*–*27* were further deposited on the P-core set tree, and a S-core set tree of the total 27 sequences was obtained (Fig. 1b). Here, the S-core set produced the set of maximum number of sequences for which MP provides a single tree. Indeed, we confirmed that there is no other subset with 27 sequences except for our S-core set which has a single MP tree by examining all possible combinations between 27 (and more) sequences and the remaining sequences. The number of sequences in the other subsets was always less than 27. We further performed the procedure (c) that decomposes the S-core set into subgroups. However, we found that there was no more sequence which could be added by the criterion of procedure (c). In this way, our final tree was constructed by only once performing the two procedures (a) and (b) in Methods. We here note by investigating all possible combinations of two groups that the P-core set of 21 sequences is a maximum number set that satisfies the additivity but the second P-core set is given by replacing the 16^{th} sequence by the 22^{th} sequence. With starting this second set, however, the original S-core set of 27 sequences did not change.

As an extension of our core-set approach, we could add further seven sequences on the S-core set (sequences *28*–*34*, designated # marks (Fig. 2b)), which could be given unique phylogenetic positions on the condition of preserving this tree by the criterion of procedure (b). Consequently, the T-core set was constructed (procedure (b) in Methods). Here, MP method provided multiple solutions of phylogeny (300 tree patterns with MP software package PAUP* [9]) (Fig. 1). A further inclusion of sequences *35*–*38* with larger deviations (denoted *) generated an explosion of topologies (14529 tree patterns) with PAUP*. Some of these patterns disrupted even the core set tree (Fig. 2c). SP tree (TCS version 1.13 software package) [28] inferred from the S-core set (*1*–*27*) separated two sequences (*19* and *27*) from the core set, and the inclusion of all sequences disrupted the S-core set tree (Fig. 2d). One way to address these topological uncertainties of MP and SP is to first fix the topology using ML, and then to determine the node sequences using MP [29]. However, ML + PAUP* also disrupted the core set (Additional file 2). The topology and node sequences were calculated with ML, software package PHYLIP (ML) [12] and PAUP*, respectively.

### 2) Mosquito phylogeny inference by constructing local core-set trees

The core set approach was applied to a data set of mitochondrial COI and COII DNA sequences of *Anopheles dirus* species A and D of Southeast Asia [30, 31]. NJ tree of which the outgroup was the *29th* haplotype was highly distorted and many of branching lengths on the tree represented smaller numbers of single substitutions (Fig. 3). Such abnormal phenomena were due to strongly evolutionary correlations among many different branches.

We first constructed a global core set (S-core set) consisting of 71 haplotypes by using the procedures (a) and (b) in Methods (the black lines of Fig. 4a). The tree structure of this core set represents a simultaneous radiation (star-like) pattern centered on the *29th* haplotype. We next decomposed the global core set into subgroups by using the procedure (c) in Methods. The subgroups obtained by this procedure are shown by the vertical bold lines of Fig. 4a. Here, the *29th* haplotype was included in all groups as their common ancestor. The procedure (a) was then applied to the individual subgroups, and the total 35 haplotypes were added in the seven subgroups (the blue lines of Fig. 4a). Fifty-eight of the remaining 92 haplotypes were inserted into each of the obtained local P-core sets by using the procedure (b) (the red lines of Fig. 4a). The second, third, and fourth columns of Table 1 give the haplotype numbers corresponding to the global core set, the local P-core sets, and the local S-core sets in each group, respectively. As a result, our final tree consists of 164 haplotypes (Fig. 4a). This tree has a finely resolved structure with definite number of substitutions on each branch, in contrast to NJ tree (Fig. 3). The asterisks denote the haplotypes of *A. dirus* D, whereas those of *A. dirus* A are unmarked. Haplotypes of *A. dirus* species A and D are intermingled across the tree. This can be interpreted as resulting from the introgression of mtDNA between the two *A. dirus* species. Details of the biological interpretations will be reported by O'Loughlin et al. [30].

Since we could definitely determine the node sequences in the final tree (Fig. 4a), we estimated the frequencies of nucleotide substitutions. We counted all the nucleotide substitutions from the *29th* haplotype (radiation centre) toward the endpoints. Of the 1537 sites, 176 sites showed nucleotide substitutions (Table 2). Of these, 55 sites had undergone multiple substitutions within the genealogy of the final tree. On the other hand, thirty-one sites had undergone non-synonymous changes, among which only five sites had experienced multiple substitutions. The multiplicity of the 176 variable sites averaged out to 1.8 because there were 312 substitutions in total. However the multiplicity was not distributed homogeneously among the variable sites. The largest number of multiple substitutions at one site which we observed was 15 (Table 2). Of these 312 substitutions, 269 were transitions and 43 were transversions (Table 3a). Non-synonymous changes comprised 10% of the total, and T→C and A→G transitions predominated. This feature became more obvious at small multiplicities with n < 4 (Table 3b). We finally estimated the number of nucleotide transitions by incorporating the 19 sequences excluded from the core set (Table 4a). Although these sequences have two phylogenetic positions, respectively, the collection of all possible patterns provided a small fluctuation in each transition matrix, as seen in this Table.

### 3) Efficacy of the core set approach by evolutionary simulation of random process

To examine the efficacy of the core set approach, we simulated a Markov process of sequence evolution, which resulted in the same phylogenetic tree as the core-set tree derived from the 164 mitochondrial COI and COII DNA sequences of *Anopheles dirus* (Fig. 4a). Here, the root sequences were randomly generated according to the nucleotide frequency distribution of the *29th* haplotype in Fig. 4a. Each nucleotide substitution was randomly generated according to the weight given in Table 3a. Since non-synonymous substitutions were rare in the above analysis, we only changed the third codon positions. Although the site heterogeneity of the evolutionary rate was neglected for simplicity, there is more chance of multiple substitutions because we kept the branch lengths. Application of the core set approach to 10 sets of the terminal sequences retained 90 % of the whole sequences as the global S-core set member and did not require any local core-set trees because of random process without strong evolutionary correlations (long-branch attractions). The global S-core set produced a unique MP tree which is consistent with the simulated tree. Furthermore, the estimated sequences at the internal nodes of the S-core set were perfectly identical to the true ones. Figure 4b illustrates the global S-core set tree in one event (the endpoint sequences of this event are listed in Additional file 3). The inclusion of the remaining 10 % sequences allowed MP to provide a number of topologies around 800. This implies that the only sequences with phylogenetic uncertainty were excluded by the core set approach.

## Discussion

The core set approach first determines a set of the sequences (P-core) that form the tree structure without multiple substitutions by using only the total deviation index. The approach next tries to sequentially insert the excluded sequences into the P-core set tree, according to the min-max criterion. If these sequences have the unique phylogenetic positions on the P-core set, they are incorporated into the P-core set as the S-core set. The min-max criterion is useful for determining order of the insertion. In fact, in the present demonstration of the HIV sequences (Fig. 1), the S-core set was formed without any missing by just stopping the insertion when MP gives multiple solutions. In this way, the core-set approach takes into explicit account of the tree-building process to construct a consistent and robust tree.

A systematic analysis of the mosquito mitochondrial DNA data set also showed that the maximum parsimony criterion was insufficient to produce a single tree structure. To demonstrate this lucidly, we consider only two groups, 3 and 5 in Fig. 4a, which consist of nine and 17 haplotypes including the *29th* haplotype, respectively. The 25 haplotypes of the two groups were analysed by using PAUP*. The analysis generated 15 different branch patterns, five of which were consistent with the branch pattern of the global core set. Only one of these five trees corresponded to our result (Fig. 5a). The other trees showed very different branch patterns from ours, as illustrated in Fig. 5b. Using our approach, we first selected the global core set of seven sequences (black lines), then the local P-core set of 10 sequences (blue lines), and then eight sequences (red lines) with the local S-core set. The fifteen topologies generated by PAUP* were the result of parallel changes between four sites and four branching pairs in the two groups (see the bold lines of Fig. 5a). The application of PAUP* to all 164 haplotypes of Fig. 4a generated more than 100,000 topologies. In this way, the core set approach makes neighbor-joining trees robust by avoiding evolutionary correlations with other groups.

By contrast, SP reduced the topological uncertainty of PAUP* to a large extent. On the other hand, the network representation made it difficult to follow the evolutionary process strictly. Complicated network mapping was demonstrated by applying TCS to the 26 haplotypes of groups 2 and 3 in Fig. 4a (Fig. 6b). The reticulated pattern can be explained by parallel changes (six pairs and one triplet of branches, shown with bold lines in Fig. 6a) mainly between the two groups. In Fig. 6b, removal of the branches marked with asterisks and changing the dotted line to a solid one produces our result (Fig. 6a). Note that PAUP* reproduced the same unique solution as ours. Applying the TCS to all the haplotypes of Fig. 4a generated a much more complicated reticulated genealogy. We applied ML + PAUP* to the 164 haplotypes in Fig. 4a. The result obtained was rather different from that of the core set approach. We demonstrate a typical example of these differences using the two sub-groups 7–1 and 7–2 in Fig. 4a (Additional file 4).

Of the 198 haplotypes, 34 had alternative phylogenetic positions, as discussed in Methods. Thirty-three haplotypes gave the one-site sharing pattern of Fig. 7a, and one haplotype gave the two-site sharing pattern of Fig. 7c. Figure 8 illustrates a typical example of uncertainty whether the *15th* haplotype should be inserted into the *7th* haplotype (node) in group 2 or into the *72nd* haplotype (node) in group 5. This uncertainty exerts a strong influence on the phylogeny, because the *15th* haplotype becomes the ancestor of the *250th*, *271st*, *127th, 221st*, *36th*, *97th*, and *197th* haplotypes, and the last three of these haplotypes confer a further ambiguity because the *97th* haplotype produces the one-site sharing pattern.

The present core set approach implicitly assumes the condition that the global core set includes a majority of sequences and constructs a main frame of tree structure. When we cannot assume this condition, we decompose the total sequences into some groups, by using the criterion of the 100 % branching resolution in MP consensus tree. Then, the present core set approach is applied to each group. The phylogeny among the groups is determined so that the branching lengths to connect them may be minimized. Details of this subject will be discussed elsewhere.

## Conclusion

The global core-set member was first selected from the complete set of sequences by the procedures (a) and (b) in Methods, and comprised the mainframe for the further tree-building construction. The local core-set trees were second constructed by preserving this mainframe. Only the sequences with alternative phylogenetic positions finally remained out of the tree. The core-set approach is assigned so that MP provides single trees for the global core-set and for each of local core-sets. Hence, MP allows multiple solutions for our final set of sequences given by combining the local core-sets. This approach provides a definite and unique tree structure without relying upon probabilistic models of sequence evolution, by excluding the sequences casusing phylogenetic uncertainty. Such a set of well-defined evolutionary pathways will provide increased power in making inferences of demographic history, and be useful in clarifying the precise mechanisms of molecular evolution. One of our next tasks is to detect definitely the sequences associated with recombination [32, 33], by systematically analysing data sets of many HIV-1 patients [27].

## Methods

### General idea of the core set approach

There is generally considerable uncertainty in the inference of a phylogenetic tree. However, in the study of molecular evolution within a population, it may be possible to make more precise inferences because the intraspecific variability of sequences is much less than the interspecific ones. Unfortunately, the existing procedures of phylogenetic inference do not fully incorporate this feature.

In our core set approach, we collect a large subset of sequences from a population, for which a reliable phylogenetic tree can be uniquely estimated. We take the infinite allele model, which neglects multiple substitutions at a nucleotide site, as a first approximation to the molecular evolution within a population. Therefore, we start by finding a subset for which the matrix of pairwise differences (numbers of different sites) is compatible with some unknown tree. This compatibility is measured by the index of deviation from the orthogonality, which we developed via MVS representation of sequences [26]. It is possible to calculate the deviation without constructing a tree. We sequentially exclude sequences that contribute greatly to the deviation until the deviation reaches the value of zero. This procedure provides a maximum set of sequences for a unique tree without multiple substitutions, which is easily estimated by any methods of phylogenetic inference. The second procedure sequentially inserts the excluded sequences into the obtained core-set tree until MP gives multiple trees, to obtain a global core set. The above-mentioned two procedures are repeated in subsets of the global set to obtain a further inclusion of the excluded sequences by removing long branching correlations.

### MVS representation of phylogeny and index of deviation from additivity

The orthogonal feature of the spatial branch vectors in MVS is essential for resolving the tree structure. Because MVS cannot be directly visualized, we use the spatial vectors of some sequences as probes ("search vectors") and investigate how other sequences are branched around the probes. A sequence *o* is selected as one of the probes and placed at the coordinate origin of MVS, from which the spatial positions, R_{o, i}, of other sequences *i* are measured. The inner product, S_{i, j}(*o*), of two vectors, R_{o, i} and R_{o, j}, gives the branch length from the origin *o* to the most recent common ancestor *c* of sequences *i* and *j*, if the branch vectors are orthogonal (Fig. 9a). This is proved by the relation S_{i, j}(*o*) = R_{o, i}·R_{o, j} = (R_{o, c} + R_{c, i})·(R_{o, c} + R_{c, j}) = R_{o, c}^{2} = D_{o, c}, sinceR_{o, c}⊥R_{c, j,} R_{o, c}⊥R_{c, i,}and R_{c, i}⊥R_{c, j}. Here, D_{i, j}is the pairwise difference between the two sequences *i* and *j* and is define das equal to R_{i, j}^{2}. The branch length can be rewritten as S_{i, j}(*o*) = R_{o, i}R_{o, j} cos(θ_{i, j}), with the angle θ_{i, j} between the vectors R_{o, i} and R_{o, j}. Using the cosine theorem (R_{i, j}^{2} = R_{o, i}^{2} + R_{o, j}^{2}-2R_{o, i} R_{o, j}cos(θ_{i, j})) about the triangle (*o*, *i*, *j*) gives the equation,

S_{i, j}(*o*) = (D_{o, i} + D_{o, j} - D_{i, j})/2. (1)

MVS representation of a tree is expressed using two other probes, *a* and *b*. We make a scatter diagram in terms of S_{i, a}(*o*) and S_{i, b}(*o*), which are taken to be the x and y values, respectively. Figure 9 illustrates the relationship between the standard tree expression (Fig. 9b) and the corresponding MVS representation (Fig. 9c). In Fig. 9c, *o = 1*, *a = 11*, and *b = 14* were taken as a triplet of probes. Here, all sequences (solid circles) lie on the three lines A, B, and C (hereafter called the "orthogonal lines"), and fulfil the following equations for orthogonality: S_{11,8}(*1*) *=* S_{11,9}(*1*) = S_{14,8}(*1*) *=* S_{14,9}(*1*) = D_{1,2}; S_{11,10}(*1*) *=* S_{14,10}(*1*) = D_{1,3}; and S_{11,13}(*1*) = S_{11,14}(*1*) = S_{14,12}(*1*) = D_{1,4}. Endpoints (8–10), 12 and 13 are arrayed along the orthogonal lines in the order of branching from the probes, because their branch vectors are orthogonal to the probe vectors R_{1,11} and R_{1,14}. Points (*8*–*10*) outside the common ancestor *4* of the probes *11* and *14* lie on the diagonal line A. Point *12*, closest to probe *11*, is located on line B, and point *13*, closest to probe *14*, is located on line C. Points *8* and *9* have been degenerated into the position of node *2*. The degeneracy can be resolved using other triplets of probes. In this way, the tree structure is analysed using a variety of triplets.

For a practical inference of the phylogeny, the pairwise distances must be estimated initially. Because the core set approach focuses on describing the evolutionary process within a species, the initial distances can be approximated well by the pairwise differences. Any multiple substitutions that have occurred on the evolutionary pathway between the two sequences cause an underestimate of the corresponding pairwise difference, which violates orthogonality. For example, when we consider parallel changes between the two branches (*3*)–(*10*) and (*7*)–(*11*), MVS representation expresses these changes as a deviation of endpoint *10* from the diagonal line A, as shown by the triangle in Fig. 9c.

The deviation of a sequence *i* is defined as the distance from the position of this sequence to the nearest orthogonal line in each triplet (*o*, *a*, *b*). Because S_{a, b}(*o*) gives the y value of the line B and the x value of line C in Fig. 9c, the deviation of *i* is expressed as follows:

V_{o, a, b, i} = | S_{b, i}(*o*) - S_{a, i}(*o*) | + E (S_{a, i}(*o*)), if S_{a, i}(*o*) < S_{a, b}(*o*) and S_{b, i}(*o*) < S_{a, b}(*o*)

= | S_{b, i}(*o*) - S_{a, b}(*o*) | + E (S_{a, i}(*o*)), if S_{a, i}(*o*) > S_{a, b}(*o*) and S_{a, i}(*o*) > S_{b, i}(*o*)

= | S_{a, i}(*o*) - S_{a, b}(*o*) | + E (S_{b, i}(*o*)), if S_{b, i}(*o*) > S_{a, b}(*o*) and S_{b, i}(*o*) > S_{a, i}(*o*). (2)

E(x) = 0 if x is an integer, and E(x) = 0.5 if x is a half-integer. The total deviation W of a system is given by summing all possible quartets of *o*, *a*, *b*, and *i*.

W = ∑_{i}W_{i} = ∑_{i} ∑_{o, a, b} V_{o, a, b, i}. (3)

### Core set approach

Our core set approach is achieved by an iteration of the following two procedures (a) and (b):

#### (a) Exclusion of sequences with deviations from the orthogonality

The first procedure identifies the sequence that gives the largest deviation (W_{i} value in Eq. (3)), and then excludes it. By repeating this procedure until any W_{i} values become zero, we obtain a P-core set of sequences that satisfies the orthogonality. The obtained core-set tree contains no correlated mutations and satisfies the compatibility. Therefore, it can be strictly reproduced by existing methods (MP, SP, ML and NJ). Each node sequence in this tree can be easily determined using three sequences which are of the nearest neighbour to this node.

#### (b) Insertion of excluded sequences based on a min-max criterion

The second procedure is to insert the excluded sequences into the P-core set tree, using a min-max criterion. By depositing each of these sequences into this tree, we can select a triplet (*o*, *a*, *b*) of probes that gives a maximum deviation, maxD, for this sequence, since the core set sequences satisfy the orthogonality perfectly. The sequence with the minimum value of maxD is the candidate. If PAUP* (HSEARCH with option of RANDOM) estimates a single MP tree, the expanded set becomes an updated core set and becomes the "current core set" in the next step. If not, the sequence is suspended in the remaining sequences and the sequence with the second smallest maxD becomes the candidate. Once the core set is updated, the maximum deviation from the updated core set becomes a new maxD values. This process continues as far as the core set is updated. As a result, we obtain the S-core set. We further expand the core set with the same process except that the tree of the current core set is regarded as real and fixed. If the phylogenetic location of the candidate sequence is uniquely determined by maximum parsimony, the sequence is added to the current core set resulting in the updated core set. Continuing this process until the core set cannot be updated, we obtain the T-core set. One insertion creates one node, the sequence of which is determined using the sequence to be inserted and the two node or endpoint sequences of the nearest neighbour. Here, the distances between this sequence and the other nodes and endpoints are modified so as to satisfy the additivity (W_{i} = 0). Consequently, this sequence is identified as a core set member. The procedure of T-core set tree-building is so useful that the S-core set tree may be almost constructed by this procedure and finally confirmed by using the PAUP*. Software is available using Additional file 5.

#### (c) Local core sets to remove long-branch attractions

The first performance of the above-mentioned two procedures provides a "global core set" containing the maximum number of sequences for which MP has a single solution. However, attractions among long branches may make it difficult to further incorporate the excluded sequences into the global core set. To solve this problem, we decompose the global core set tree into two subgroups (each subgroup has to include more than three sequences). All the excluded sequences are merged into the first subgroup of core-set member and the procedure (a) is applied to the merged group (first subgroup + excluded sequence) with keeping a condition of preserving the tree structure of the global core-set. Then, the sequences that satisfy the orthogonality within this group are drawn out to construct a local P-core set. Here, when added sequences form the outside branches (outgroup) in this subgroup tree, they are suspended. We next insert the remaining excluded sequences by using the procedure (b), and get a local S-core set in which MP gives a unique tree within this subgroup. The two procedures are applied also to the second subgroup of core set. Here, the sequences inserted, as a result, into the two subgroups are suspended. The two procedures are applied to all possible pairs of subgroups. The procedure of T-core set tree-building is finally applied to the whole system for the remaining excluded sequences.

We furthermore decompose the global core set tree into more than three subgroups, and repeat the above-mentioned procedure. As a result, we obtain a maximum set of sequences in which there is no more sequence to be incorporated at a certain number of subgroups. It is here noted that MP proposes multiple solutions for the sequences containing the local S-core sets due to long-branch attractions.

#### (d) Remaining sequences with two phylogenetic positions.

The core set approach detected through practical applications that the remaining sequences despite the procedures (a)-(c) have two phylogenetic positions. The following three patterns were found for these unplaced sequences. Figure 7 shows three branching patterns that an ancestral sequence *1* with a sequence of the same nucleotide (A, adenine) splits into three sequences of *2, 3* and *4*. Here, A *n→* C denotes a substitution from A to C at the *n th* site. The dotted lines denote alternative positions of sequence *4*. One of two sites is shared in Fig. 7a. Two of three sites are shared in Fig. 7b. Two of four sites are shared in Fig. 7c. In this situation, sequence *4* cannot be uniquely placed by the maximum parsimony criterion. However, since the alternative positions is explicitly indicated, we can estimate the nucleotide transitions in all possible branching patterns with the variance around their average value.

#### (e) CPU time of numerical calculations.

The majority of the CPU time is the calculation of the deviation index value, which includes the four DO-LOOPS in a FORTRAN program with double precision. This means that the CPU is proportional to N^{4} with the number N of sequences. It took about 12 seconds for N = 36 sequences and 8 hours for N = 198 sequences with a standard personal computer (for instance, Pentium 2GHz). Therefore, the proportional constant is about 50000^{-1} seconds. On the other hand, we used the PAUP* for the MP method, but this software has a limit (the number of sequences is less than 14) in the "exhaust" option in which all the possible topologies can be searched. Since it is well known that the number of topologies explodes with increase of N, our approach has an advantage in this respect.

## References

- 1.
Bush RM, Bender CA, Sbbararo K, Cox NJ, Fitch WM: Predicting the evolution of human influenza A. Science. 1999, 268: 1921-1925. 10.1126/science.286.5446.1921.

- 2.
Frost SDW, Günthard HF, Wong JK, Havlir D, Richman DD, Leigh Brown AJ: Evidence for positive selection driving the evolution of HIV-1

*env*under potent therapy. Virology. 2001, 284: 250-258. 10.1006/viro.2000.0887. - 3.
de S Leal E, Holmes EC, PM de A Zanotto : Distinct patterns of natural selection in the reverse transcriptase gene of HIV-1 in the presence and absence of antiretroviral therapy. Virology. 2004, 325: 181-191. 10.1016/j.virol.2004.04.004.

- 4.
Pagel M: Inferring the historical patterns of biological evolution. Nature. 1999, 401: 877-884. 10.1038/44766.

- 5.
Ross HA, Rodrigo A: Immune-mediated positive selection drives human immunodeficiency virus type 1 molecular variation and predicts disease duration. J Virol. 2002, 76: 11715-11719. 10.1128/JVI.76.22.11715-11720.2002.

- 6.
Williamson S: Adaptation in the

*env*gene of HIV-1 and evolutionary theories of disease progression. Mol Biol Evol. 2003, 20: 1318-1325. 10.1093/molbev/msg144. - 7.
Yang Z, Nielsen R, Goldman N, Pedersen AMK: Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000, 155: 431-449.

- 8.
Fitch WM: Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology. 1971, 20: 406-416. 10.2307/2412116.

- 9.
Swofford DL: PAUP* Phylogenetic Analysis Using Parsimony (* and Other Methods). ver 4.0b10. 2002, Sinauer Associates

- 10.
Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.

- 11.
Adachi J, Hasegawa M: MOLPHY: Programs for molecular phylogenetics. ver 2.3. 1996, Institute of Statistical Mathematics, Tokyo

- 12.
Felsenstein J: PHYLIP. ver 3.572. 1996, Univ of Washington Seattle

- 13.
Templeton AR, Crandall KA, Sing CF: A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping and DNA sequence data. III. Cladogram estimation. Genetics. 1992, 132: 619-633.

- 14.
Antunes A, Templeton AR, Guyomard R, Alexandrino P: The role of nuclear genes in intraspecific evolutionary inference: genealogy of the transferrin gene in the brown trout. Mol Biol Evol. 2002, 19: 1272-87.

- 15.
Crandall KA: Intraspecific cladogram estimation: accuracy at higher levels of divergence. Sys Biol. 1994, 43: 222-235.

- 16.
Crandall KA, Templeton AR: Applications of intraspecific phylogenies. New uses for new phylogenies. Edited by: Harvey P, Brown AJL, Smith JM, Nee S. 1996, Oxford University Press. Oxford, U. K, 81-99.

- 17.
Templeton AR, Sing CF: A cladistic analysis of phenotypic associations with haplotypes inferred from restriction endonuclease mapping. IV. Nested analyses with cladogram uncertainty and recombination. Genetics. 1993, 134: 659-669.

- 18.
Bandelt HJ, Dress AW: Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol Phylogenet Evol. 1992, 1: 242-52. 10.1016/1055-7903(92)90021-8.

- 19.
Excoffier L, Smouse PE: Using allele frequencies and geographic subdivision to reconstruct gene trees within a species: Molecular variance parsimony. Genetics. 1994, 136: 343-359.

- 20.
Fitch WM: Networks and viral evolution. J Mol Evol. 1997, 44 (Suppl 1): S65-75. 10.1007/PL00000059.

- 21.
Johnson JB, Jordan S: Phylogenetic divergence in leatherside chub (

*Gila copei*) inferred from mitochondrial cytochrome b sequences. Molecular Ecology. 2000, 9: 1029-1035. 10.1046/j.1365-294x.2000.00960.x. - 22.
Templeton AR, Routman E, Phillips CA: Separating population structure from population history: a cladistic analysis of geographical distribution of mitochondrial DNA haplotypes in the tiger salamander. Ambystoma tigrimum Genetics. 1995, 140: 767-782.

- 23.
Templeton AR: Nested clade analyses of phylogeographic data: testing hypotheses about gene flow and population history. Molecular Ecology. 1998, 7: 381-397. 10.1046/j.1365-294x.1998.00308.x.

- 24.
Turner TF, Trexler JC, Harris JL, Haynes JL: Nested cladistic analysis indicates population fragmentation shapes genetic diversity in a freshwater mussel. Genetics. 2000, 154: 777-785.

- 25.
Kitazoe Y, Kurihara Y, Narita Y, Okuhara Y, Tominaga A, Suzuki T: A New Theory of Phylogeny Inference Through Construction of Multidimensional Vector Space. Mol Biol Evol. 2001, 18: 812-828.

- 26.
Kitazoe Y, Kishino H, Okabayashi T, Watabe T, Nakajima N, Okuhara Y, Kurihara Y: Multidimensional vector space representation for convergent evolution and molecular phylogeny. Mol Biol Evol. 2005, 22: 704-715. 10.1093/molbev/msi051.

- 27.
Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL, Mullins JI: Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol. 1999, 73: 10489-10502.

- 28.
Clement M, Posada D, Crandall KA: PROGRAM NOTE. TCS: a computer program to estimate gene genealogies. Molecular Ecology. 2000, 9: 1657-1659. 10.1046/j.1365-294x.2000.01020.x.

- 29.
Sheridan I, Pybus OG, Holmes EC, Klenerman P: High resolution phylogenetic analysis of hepatitis C virus adaptation and its relationship to disease progression. J Virol. 2004, 78: 3447-3454. 10.1128/JVI.78.7.3447-3454.2004.

- 30.
O'Loughlin S, Walton C: to be submitted

- 31.
Walton C, Handley JM, Tun-Lin W, Collins FH, Harback RE, Baimai V, Butlin RK: Population structure and population history of

*Anopheles dirus*mosquitoes in Southeast Asia. Mol Biol Evol. 2000, 17: 962-974. - 32.
Templeton AR, Clark AG, Weiss KM, Nickerson DA, Boerwinkle E, Sing CF: Recombinational and mutational hotspots within the human lipoprotein lipase gene. Am J Hum Genet. 2000, 66: 69-83. 10.1086/302699.

- 33.
Templeton AR, Reichert RA, Weisstein AE, Yu XF, Markham RB: Selection in context: patterns of natural selection in the glycoprotein 120 region of human immunodeficiency virus 1 within infected individuals. Genetics. 2004, 167: 1547-61. 10.1534/genetics.103.023945.

## Acknowledgements

We thank Jeffrey L. Thorne for his constructive criticism and insights. This work was supported by the Institute for Bioinformatics Research and Development (BIRD) of the Japan Science and Technology Agency (JST) and the Japan Society for the Promotion of Science (JSPS).

## Author information

## Additional information

### Authors' contributions

TO performed the majority of phylogenetic analysis and drafted the manuscript. YK, HK, TW, NN and YO performed some analysis of the data. SO and CW provided mosquito data, and assisted in biological interpretation. YK and HK participated in project coordination and the writing of the manuscript. All authors read and approved the final manuscript.

## Electronic supplementary material

### Additional File 2: The core set tree compared to the combined ML and MP. The topology was determined by ML using the 38 sequences of HIV-1, whereas the node sequences were obtained using MP. This tree pattern given by the two-step procedure broke the core set (Fig. 2a and 2b. (PDF 40 KB)

### Additional File 3: 164 sequences used in Fig. 4b as one event of evolutionary simulation (NEX 85 KB)

### Additional File 4: The core set tree compared to the combined ML and MP. The topology was determined by ML using the 164 haplotypes of Fig. 4a, whereas the node sequences were obtained using MP. For simplicity, the branch pattern (a) of the two sub-groups 7–1 and 7–2 in Fig. 4a was compared with that given by the two-step procedure (b), in which the lineage of the *207th*, *242nd*, *284th* and *245th* haplotypes moved from group 7–2 to group 7–1. (PDF 54 KB)

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

## About this article

### Cite this article

Okabayashi, T., Kitazoe, Y., Kishino, H. *et al.* Core set approach to reduce uncertainty of gene trees.
*BMC Evol Biol* **6, **41 (2006) doi:10.1186/1471-2148-6-41

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Maximum Parsimony
- Statistical Parsimony
- Multiple Substitution
- Nucleotide Transition
- Node Sequence