 Research article
 Open Access
Birth and death of protein domains: A simple model of evolution explains power law behavior
 Georgy P Karev^{1},
 Yuri I Wolf^{1},
 Andrey Y Rzhetsky^{2},
 Faina S Berezovskaya^{3} and
 Eugene V Koonin^{1}Email author
https://doi.org/10.1186/14712148218
© Karev et al; licensee BioMed Central Ltd. 2002
 Received: 3 September 2002
 Accepted: 14 October 2002
 Published: 14 October 2002
Abstract
Background
Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. In network analysis, power laws imply evolution of the network with preferential attachment, i.e. a greater likelihood of nodes being added to preexisting hubs. Exploration of different types of evolutionary models in an attempt to determine which of them lead to power law distributions has the potential of revealing nontrivial aspects of genome evolution.
Results
A simple model of evolution of the domain composition of proteomes was developed, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from noncoding or nonglobular sequences or acquisition via horizontal gene transfer). This formalism can be described as a b irth, d eath and i nnovation m odel (BDIM). The formulas for equilibrium frequencies of domain families of different size and the total number of families at equilibrium are derived for a general BDIM. All asymptotics of equilibrium frequencies of domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the degree not equal to 1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of the equilibrium frequencies of domain families of different size are determined for each case. We apply the BDIM formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show an excellent fit between these empirical data and a particular form of the model, the secondorder balanced linear BDIM. Calculation of the parameters of these models suggests surprisingly high innovation rates, comparable to the total domain birth (duplication) and elimination rates, particularly for prokaryotic genomes.
Conclusions
We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.
Keywords
 Horizontal Gene Transfer
 Domain Family
 Generalize Pareto Distribution
 Innovation Rate
 Equilibrium Size
Background
Sequencing of numerous genomes from all walks of life, including multiple representatives of diverse lineages of bacteria, archaea and eukaryotes, creates unprecedented opportunities for comparativegenomic studies [1–3]. One of the mainstream approaches of genomics is comparative analysis of protein or domain composition of predicted proteomes [2, 4, 5]. These studies often concentrate on domains rather than entire proteins because many proteins have variable multidomain architectures, particularly in complex eukaryotes (throughout this work, we use the term domain to designate a distinct evolutionary unit of proteins, which can occur either in the standalone form or as part of multidomain architectures; often but not necessarily, such a unit corresponds to a structural domain). As soon as genome sequences of bacteria became available, it has been shown that a substantial fraction of the genome of each species, from approximately one third in bacteria with very small genomes, to a significant majority in species with larger genomes, consists of families of paralogs, genes that evolved via gene duplication at different stages of evolution [6–9]. Again, a comprehensive analysis of paralogous relationships between genes is probably best performed at the level of individual protein domains, first, because many proteins share only a subset of common domains, and second, because domains can be conveniently and with a reasonable accuracy detected using available collections of domainspecific sequence profiles [10–12]. Comparisons of domain repertoires revealed both substantial similarities between different species, particularly with respect to the relative abundance of housekeeping domains, and major differences [4, 5]. The most notable manifestation of such differences is lineagespecific expansion of protein/domain families, which probably points to unique adaptations [13, 14]. Furthermore, it has been demonstrated that more complex organisms, e.g. vertebrates, have a greater variety of domains and, in general, more complex domain architectures of proteins than simpler life forms [1, 2].
Lineagespecific expansions and gene loss events detected as the result of comparative analysis of the domain compositions of different proteomes have been examined mostly at a qualitative level, in terms of the underlying biological phenomena, such as adaptation associated with expansion or coordinated loss of functionally linked sets of genes [15]. A complementary approach involves quantitative comparative analysis of the frequency distributions of proteins or domains in different proteomes. Several studies pointed out that these distributions appeared to fit the power law: P(i) ≈ ci^{γ} where P(i) is the frequency of domain families including exactly i members, c is a normalization constant and γ is a parameter, which typically assumes values between 1 and 3 [16–19]. Obviously, in doublelogarithmic coordinates, the plot of P as a function of i is a straight line with a negative slope. Power laws appear in numerous biological, physical and other contexts, which seem to be fundamentally different, e.g. distribution of the number of links between documents in the Internet, the population of towns or the number of species that become extinct within a year. The famous Pareto law in economics describing the distribution of people by their income and the Zipf law in linguistics describing the frequency distribution of words in texts belong in the same category [20–29]. Recent studies suggested that power laws apply to the distributions of a remarkably wide range of genomeassociated quantities, including the number of transcripts per gene, the number of interactions per protein, the number of genes or pseudogenes in paralogous families and others [30].
Power law distributions are scalefree, i.e. the shape of the distribution remains the same regardless of scaling of the analyzed variable. In particular, scalefree behavior has been described for networks of diverse nature, e.g. the metabolic pathways of an organism or infectious contacts during an epidemic spread [20, 25–27]. The principal pattern of network evolution that ensures the emergence of power distributions (and, accordingly, scalefree properties) is preferential attachment, whereby the probability of a node acquiring a new connection increases with the number of connections this node already has.
However, a recent thorough study suggested that many biological quantities claimed to follow power laws, in fact, are better described by the socalled generalized Pareto function: P(i) = c(i+a)^{γ} where a is an additional parameter [31]. Obviously, although at i >>a, a generalized Pareto distribution becomes indistinguishable from a power law, at small i, it deviates significantly, the magnitude of the deviation depending on the value of a. Furthermore, unlike power law distributions, generalized Pareto distributions do not show scalefree properties.
The importance of the analysis of frequency distributions of domains or proteins lies in the fact that distinct forms of such distributions can be linked to specific models of evolution. Therefore, by exploring the distributions, inferences potentially can be made on the mode and parameters of genome evolution. For this purpose, the connections between domain frequency distributions and evolutionary models need to be explored theoretically within a maximally general class of models. In this work, we undertake such a mathematical analysis using simple models of evolution, which include duplication (birth), elimination (death) and de novo emergence (innovation) of domains as elementary processes (hereinafter BDIM, birth death innovation models). All asymptotics of equilibrium frequencies of domain families of different size possible for BDIM are identified and their dependence on the parameters of the model is investigated. In particular, analytical conditions on birth and death rates that produce power asymptotics are determined. We prove that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order, and that any power asymptotic with the degree not equal to 1 can appear only if the assumption of independence of the duplication/deletion rates on the size of a domain family is rejected. We apply the developed formalism to the analysis of the frequency distributions of domains in individual prokaryotic and eukaryotic genomes and show a good fit of these data to a particular version of the model, the secondorder balanced linear BDIM.
Results and Discussion
Mathematical theory and model
Fundamental definitions and assumptions
A genome is treated as a "bag" of coding sequence for protein domains, which we simply call domains for brevity. Domains are treated as independently evolving units disregarding the dependence between domains that tend to belong to the same multidomain protein. Each domain is considered to be a member of a family (including singlemember families). We consider three types of elementary evolutionary events: i) domain birth, which generates a new member within a family; the principal mechanism of birth is duplication with divergence but additional mechanisms may be considered, including acquisition of a family member from a different species via horizontal gene transfer [32], ii) domain death, which results from domain inactivation and/or deletion, and c) domain innovation, which generates a new family with one member. Innovation may occur via horizontal gene transfer from another species, via domain evolution from a noncoding sequence or a sequence of a nonglobular protein, or via major change of a domain from a preexisting family after a duplication, which makes the relationship between the given domain and its family of origin undetectable (this latter process formally combines domain birth, death and innovation in a single event). The innovation rate (ν), is considered constant for a given genome. The rates of elementary events are considered to be independent of time (i. e. only homogeneous models are considered) and of the nature (structure, biological function etc.) of individual families.
The formulation of the model
The simple BDIM
Let us formulate the following independence assumption: i) all elementary events are independent of each other; ii) the rates of individual domain birth (λ) and death (δ) do not depend on i (number of domains in a family). Under this assumption, the instantaneous rate, at which a domain family leaves class i, is proportional to i and the following simple BDIM describes the evolution of such a system of domain family classes:
df_{1}(t)/dt = (λ + δ) f_{1}(t) + 2δf_{2}(t) + ν
df_{ i }(t)/dt = (i  1)λf_{i1}(t)  i(λ + δ)f_{ i }(t) + (i + 1) δ f_{i+1}(t) for 1<i<N, (2.1)
df_{ N }(t)/dt = (N  1)λ f_{N1}(t)  N δ f_{ N }(t).
Similar models have been considered previously in several different contexts [33 v. 1, ch. 17, 34]. We will see in 3.2 that the solution of model (2.1) evolves to equilibrium, with a unique distribution of domain family sizes, f_{ i }~(λ/δ)^{ i }/i; in particular, if λ = δ, then f_{ i }~1/i. Thus, under the simple BDIM, if the birth rate equals the death rate, the abundance of a domain class is inversely proportional to the size of the families in this class. When the observations do not fit this particular asymptotic (as observed in several studies on distributions of protein family sizes), a different, more general model needs to be developed.
The Master BDIM
A more general BDIM emerges when the independence assumption is abandoned. Instead of constructing specific hypotheses regarding the dependence between the elementary events, let us simply suppose that the domain birth and death rates for a family of class i do not necessarily show proportionality to i. For the general case, we designate these rates, respectively, λ_{ i } and δ_{ i }; in the specific case of the simple BDIM (2.1), λ_{ i } = λi and δ_{ i } = δi. Then we have the following master BDIM:
df_{1}(t)/dt = (λ_{1} + δ_{1})f_{1}(t) + δ_{2}f_{2}(t) + ν
df_{ i }(t)/dt = λ_{i1}f_{i1}(t)  (λ_{ i } + δ_{ i })f_{ i }(t) + δ_{i+1}f_{i+1}(t) for 1<i<N, (2.2)
df_{ N }(t)/dt = λ_{N1}f_{N1}(t)  δ_{ N }f_{ N } (t).
Let F(t)= f_{ i }(t) be the total number of domain families at instant t; it follows from (2.2) that
dF(t)/ dt = ν  δ_{1}f_{1}(t) (2.3)
The system (2.2) has an equilibrium solution f_{1},...f_{ N } defined by the equality df_{ i }(t)/dt = 0 for all i; this solution is described below under Proposition 1. Accordingly, there exists an equilibrium solution of equation (2.3), which we will designate F_{ eq } (the total number of domain families at equilibrium). At equilibrium, ν = δ_{1}f_{1}, i.e. the processes of innovation and death of single domains (more precisely, the death of domain families of class 1, i.e. singletons) are balanced.
We can rewrite the model (2.2) in terms of the frequency of a domain family of class i p_{ i }(t) = f_{ i }(t)/F(t). Let x(t) = y(t)/Y(t); then
dx/dt = [dy/dt /y  dY/dt /Y] x.
Applying this identity to p_{ i }(t) and rewriting equation (2.3) in the form
[dF(t)/dt]/F(t) = ν/F(t)  δ_{1}p_{1}(t) (2.3')
we obtain the following model for frequencies of the domain family (master BDIM for frequencies), which is equivalent to (2.2):
dp_{1}(t)/dt = (λ_{1} + δ_{1})p_{1}(t) + δ_{2}p_{2}(t) + ν/F(t)  (ν/F(t)  δ_{1}p_{1}(t))p_{1}(t), (2.4)
dp_{ i }(t)/dt = λ_{i1}p_{i1}(t)  (λ_{ i } + δ_{ i })p_{ i }(t) + δ_{i+1}p_{i+1}(t)  (ν/F(t)  δ_{1}p_{1}(t)) p_{ i }(t) for 1<i<N,
dp_{ N }(t)/dt = λ_{N1}p_{N1}(t)  δ_{ N }p_{ N } (t)  (ν/F(t)  δ_{1}p_{1}(t))] p_{ N }(t).
System (2.4) should be solved together with equation (2.3).
The Master BDIM and Markov processes
Let us note that system (2.4) for frequencies is nonlinear, so it is not a system of Kolmogorov equations for state probabilities of any homogeneous Markov process. Let us further suppose that a genome had ample time to arrive at an equilibrium with respect to the total number of domain families, such that F(t) = F_{ eq }. This does not imply dp_{ i }(t)/dt = 0 or df_{ i }(t)/dt = 0; in other words, the system might rearrange the frequencies of individual families, although the total number of families remains stable. If F(t) = F_{ eq }, the master system (2.4) turns into
d p_{1}(t)/dt = (λ_{1} + δ_{1}) p_{1}(t) + δ_{2}p_{2}(t) + ν/F_{ eq } (2.5)
d p_{ i }(t)/dt = λ_{i1}p_{i1}(t)  (λ_{ i } + δ_{ i }) p_{ i }(t) + δ_{i+1}p_{i+1}(t) for 1<i<N,
d p_{ N }(t)/dt = λ_{N1}p_{N1}(t)  δ_{ N }p_{ N } (t).
System (2.5) can be rewritten as a matrix equation
d p(t)/dt = p(t)Q,
where p(t) = {p_{1}(t),...p_{ N }(t)} and the matrix Q = (q_{ ij }) is defined by equalities
q_{11} = (λ_{1} + δ_{1}) + ν/F_{ eq }, q_{21} = δ_{2} + ν/F_{ eq }, q_{s 1} = ν/F_{ eq } for all s > 2;
q_{i1,i} = λ_{i1}, q_{i,i} = (λ_{ i } + δ_{ i }), q_{i+1,i} = δ_{i+1}, q_{k,i} = 0 for all k, ik > 1, i = 2,...N1,
q_{N1,N} = λ_{N1}, q_{N,N} = δ_{ N }.
It is easy to see that the sum of elements of each row (except for the first one) of the matrix Q is equal to ν/F_{ eq } > 0. Therefore the matrix Q cannot be a matrix of transition rates for any Markov process (the sum of elements of each row of a matrix of transition rates for Markov process with continuous time should be nonpositive [33 v. 1, ch.17, s. 8, 35 v. 2, ch. 3, s. 2]; in other words, there is no Markov process with continuous time and state space {1,2,...N} whose state probabilities satisfy system (2.5).
Thus, neither the initial BDIMs (2.1) or (2.2) nor the equilibrium model (2.5) can be described by any Markov process with continuous time.
Remark. If, in system (2.5), ν = 0, then this system turns into a system of state probabilities for a Markov birth and death process with continuous time.
Equilibrium in BDIMs
Equilibrium sizes and frequencies of the domain family system

(λ_{1} + δ_{1}) f_{1} + δ_{2}f_{2} + ν = 0,
λ_{i1}f_{i1}  (λ_{ i } + δ_{ i })f_{ i } + δ_{i+1}f_{i+1} = 0 for 1<i<N, (3.1)
λ_{N1}f_{N1}  δ_{ N }f_{ N } = 0.
It should be emphasized that the master model does not assign a priori the value of F_{ eq }; this value has to be computed depending on the model parameters.
The following statement is central for further analysis.
Proposition 1. The master BDIM (2.2) has a unique equilibrium state (f_{1},...f_{ N }), which is the sole solution of system (3.1):
f_{1} = ν/δ_{1}
f_{ i } = ν λ_{ j } / δ_{ j } for all i = 2,...N. (3.2)
The unique equilibrium state (3.2) is globally asymptotically stable.
In addition (formally assuming λ_{ j } = 1 for i = 1),
F_{ eq } = ν ( λ_{ j } / δ_{ j } (3.3)
This proposition ascertains that all evolutionary trajectories of the system (2.2) exponentially (with respect to time) approach the equilibrium state (3.2). The proof is given in the Mathematical Appendix.
Remark. Let us denote the ratio of the birth rate to the innovation rate
G(N) ≡ λ_{ i }f_{ i }/ν,
and the ratio of the death rate to the innovation rate
I(N) ≡ δ_{ i }f_{ i }/ν.
Then, according to Proposition 1, for any BDIM in equilibrium,
G(N)  I(N) = λ_{ j } / δ_{ j }  λ_{ j }/δ_{ j }  1 = 1.
The principal goal of the treatment that follows is the analysis of the asymptotic behavior of equilibrium frequencies and sizes of domain families (f_{1},...f_{ N }) at large N. We will differentiate two cases of asymptotic behavior according to the following
Definition. Let {q_{ i }}, {s_{ i }} be sequences of real numbers; let us denote q_{ i } s_{ i } if lim q_{ i }/s_{ i } = 1 and q_{ i } ~ s_{ i } if lim q_{ i }/s_{ i } = c = const and 0<c<∞. We will also use this notation for finite but sufficiently long sequences.
Equilibrium frequencies for the simple BDIM
Let us apply Proposition 1 to the simple BDIM (2.1) with λ_{ i } = λi, δ_{ i } = δi.
Definition. A simple BDIM is balanced if θ = λ/δ = 1, i.e. if the rates of individual domain birth and death are equal.
Let us recall that a random discrete variable ξ has the logarithmic distribution with parameter θ < 1 if
P(ξ = i) = θ^{ i }/i [ln(1θ)]^{1}, i = 1,2,...
A random variable ξ has the truncated logarithmic distribution with parameter θ if
P (ξ = i) = C_{ n } θ^{ i } / i, i = 1,2,...n, C_{ n } = 1/ θ^{ j }/j.
Then, we have
Proposition 2.Para>1) For any simple BDIM (2.1)
f_{ i } = (ν/δ)θ^{i1}/i = (ν/λ)θ^{ i }/i, (3.4)
F_{ eq } = f_{ i } = ν/δ θ^{j1}/j, (3.5)
and
p_{ i } = (1/F_{ eq })(ν/δ)θ^{i1}/i = (θ^{ i }/i) / θ^{ j }/j (3.6)
that is, the equilibrium frequencies have the truncated logarithmic distribution if θ < 1.
2) If a simple BDIM is balanced, then
F_{ eq } = ν/δ 1/j, (3.7)
and for all i = 1,2,...N
p_{ i } = ν/δF_{ eq }/i = ( 1/j)^{1} / i. (3.8)
The proof is given in the Mathematical Appendix.
Thus, a simple BDIM can have equilibrium frequencies only of the form p_{ i } = C θ^{ i }/i, where C = const and θ is the distribution parameter. In particular, the equilibrium frequencies for a balanced simple BDIM have the power distribution with the degree equal to 1.
Simple methods exist for preliminary graphical estimation of the single distribution parameter θ [36 ch. 7, s. 7]. We will prove in the following section that, if we observe a power asymptotic for empirically observed equilibrium frequencies, then (assuming that the system can be described by a BDIM), the rates λ_{ i } and δ_{ i } should be asymptotically equal at large i. If, additionally, the degree of the asymptotic is not equal to 1, then the system dynamics cannot be described by a simple BDIM. In this case, it is necessary to consider more general models, such as the Master BDIM (2.2).
Asymptotic behavior of equilibrium frequencies of a Master BDIM: Main Theorems
Let us consider the master BDIM (2.2); we showed in 3.1 that its equilibrium frequencies are the solution of the system
 (λ_{1} + δ_{1})p_{1} + δ_{2}p_{2} + ν/F_{ eq } = 0, (3.9)
λ_{i1}p_{i1}  (λ_{ i } + δ_{ i })p_{ i } + δ_{i+1}p_{i+1} = 0 for 1<i<N,
λ_{N1}p_{N1}  δ_{ N }p_{ N } = 0.
The following theorem gives all possible types of asymptotic behavior of the equilibrium frequencies and defines the connections between these asymptotics depending on model parameters. In particular, if there is no information on the exact form of dependence of the rates of birth and death of domains on the size of a domain family, the theorem can be used to qualitatively describe the dynamics of the asymptotic behavior of the equilibrium frequencies.
We will prove that the asymptotic behavior of a solution of system (3.9) is completely defined by the asymptotic relation between λ_{ i } and δ_{ i }. More precisely, let us define a function χ (i)= λ_{i1}/δ_{ i }; we consider only functions of power growth, i.e. χ (i) ~ i^{ s } at i→∞ for a real s. We will see that this is not a serious restriction because the most realistic situations correspond to the case of s = 0. So, let us suppose that, for large i, the following expansion is valid:
χ (i) ≡ λ_{i1}/δ_{ i } = i^{ s } θ (1+a/i + O(1/i^{2})) (3.10)
where s, a are real numbers and θ > 0. Evidently, if s ≠ 0, χ (i) tends either to 0 (s < 0) or to ∞ (s > 0) with the increase of i.
Definition. Let us refer to a BDIM (2.2), (3.10) as
i. nonbalanced, if s ≠ 0;
ii. firstorder balanced, if s = 0 and θ ≠ 1, i.e.
λ_{i1}/δ_{ i } = θ (1+a/i + O(1/i^{2})) at large i; (3.11)
iii. secondorder balanced, if s = 0, θ = 1 and a ≠ 0, i.e.
λ_{i1}/δ_{ i } = 1 + a/i + O(1/i^{2})) for large i; (3.12)
iv. highorder balanced, if s = 0, θ = 1 and a = 0, i.e.
λ_{i1}/δ_{ i } = 1 + O(1/i^{2})) for large i.
We will show that the first three coefficients, s, θ and a, of asymptotic expansion (3.10) for χ (i) = λ_{i1}/δ_{ I } exactly specify all possible asymptotic behaviors of BDIM equilibrium frequencies.
Theorem 1. The equilibrium frequencies p _{ i } of BDIM (2.2) have the following asymptotics
i. if the model is nonbalanced, then
p_{ i } ~ Γ (i)^{ s }θ^{ i }i^{ a }, where Γ (i) is the Γfunction;
ii. if the model is firstorder balanced, then
p_{ i } ~ θ^{ i }i^{ a };
iii. if the model is secondorder balanced, then
p_{ i } ~ i^{ a };
iv. if the model is highorder balanced, then
p_{ i } ~ 1
It follows from this theorem that, if a BDIM is nonbalanced, then its equilibrium frequencies p_{ i } (and equilibrium family sizes f_{ i }) increase or decrease extremely fast (hyperexponentially) with the increase of i. In contrast, if a BDIM has a nonzero order of balance, asymptotic behavior is observed.
Let us recall that a random discrete variable ξ has the Pascal (or negative binomial) distribution with parameters (r,q), r > 0, 0 <q < 1, if P(ξ = k) = Γ(r+k)/[Γ(r) Γ(1+k)] (1q)^{ r }q^{ k }[36]. We will say that sequence {p_{ i }} follows (or asymptotically has) a discrete probabilistic distribution {q_{ i }} if p_{ i } ~ q_{ i } for large enough i.
Corollary 1.For a firstorder balanced BDIM with θ < 1,
i. if a > 1, the equilibrium frequencies p_{ i } follow Pascal distribution with parameters (a+1,θ);
ii. if a = 1, the equilibrium frequencies follow truncated logarithmic distribution with parameter θ;
iii. if a = 0, the equilibrium frequencies follow geometric distribution with parameter θ.
The following implication of Theorem 1 is of principal interest.
Corollary 2. Equilibrium frequencies of a BDIM have a power asymptotic behavior if and only if the BDIM is secondorder balanced.
Corollary 3. For highorder balanced BDIM, if λ_{i1}/δ_{ i } = 1 for all i, the only possible distribution of equilibrium frequencies is uniform, p_{ i } = const for all i. Moreover, even if λ_{i1}/δ_{ i } = 1 + O(1/i^{2}), the equilibrium frequencies asymptotically tend to the uniform distribution.
Rational BDIM
Let us suppose that the birth and death rates are of the form
λ_{ i } = λ P(i) = λ (i + a_{ k })^α_{ k }, (4.1)
δ_{ i } = δ Q(i) = δ (i + b_{ k })^β_{ k }
for i > 0, where λ, δ are positive constants, α_{ k }, β_{ k } are real and a_{ k }, b_{ k } are nonnegative for all k = 1,...N.
We will refer to BDIM (2.2.), (4.1) as rational BDIM.
It is known that a wide class of mathematical functions can be well approximated by rational functions of the form (4.1) (see, e.g. [37]).
Specific cases of the rational BDIM are simple BDIM with P(i) = i, Q(i) = i, linear BDIM with P(i) = i + a_{1}, Q(i) = i + b_{1}, where a_{1}, b_{1} are constants, and polynomial BDIM, if P(i) and Q(i) are polynomials on i.
The following theorem describes all possible asymptotic behaviors of the equilibrium frequencies of a rational BDIM. Let us denote
θ =λ/δ,
Theorem 2. The equilibrium sizes of domain families of a rational BDIM have the following asymptotics
f_{ i } C ν/λ Γ(i)^{η} θ^{ i }i^{ρβ} (4.2)
where the constant C = (Γ(1 + b_{ k })^β_{ k }/ Γ(1 + a_{ k })^α_{ k }. (4.3)
The proof is given in the Mathematical Appendix.
Corollary 1. If η = 0, then the rational BDIM is firstorder balanced and the sequence of equilibrium numbers of domain families {f_{ i }} has a powerexponential asymptotics
f_{ i } C ν/λ θ^{ i }i^{ρβ}. (4.4)
In particular, if ρ  β > 1, the equilibrium frequencies p_{ i } follow the Pascal distribution with parameters (ρ  β + 1, θ);
if ρ  β = 1, then frequencies p_{ i } follow the truncated logarithmic distribution;
if ρ  β = 0, then frequencies p_{ i } follow the geometric distribution.
Corollary 2. The equilibrium sizes of domain families f_{ i } and equilibrium frequencies p_{ i } for a rational BMID have the power asymptotics if and only if η = 0 and λ = δ, i.e. the BDIM is secondorder balanced, in which case
f_{ i } C ν/λ i^{ρβ}. (4.5)
Formula (4.4) gives the asymptotics for the equilibrium sizes of domain families f_{ i } and, accordingly, for the total number of families F_{ eq }. The exact expressions for these quantities are given in the proofs of Theorem 2 and Lemma (see Mathematical Appendix).
Proposition 3.
i. The equilibrium sizes of domain families f _{ i } of a balanced (first or higher order) rational BDIM are
f_{ i } = C ν/δθ^{i1} [(Γ(i + a_{ k }))^α_{ k }]/ [(Γ(i + 1 + b_{ k }))^β_{ k }] for all i = 1,2,...
where
C = [(Γ(1 + b_{ k }))^β_{ k }]/ (Γ(1 + a_{ k })^α_{ k }].
ii. The total number of domain families at equilibrium is
F_{ eq } = C ν/δ( θ^{j1} (Γ(j + a_{ k }))^α_{ k }/ (Γ(j + 1 + b_{ k }))^β_{ k }).
For the rational, secondorder balanced BDIM, the ratio of the birth rate to the innovation rate is
G(N) = θ^{ i } [Γ(i + 1 + a_{ k })/Γ(1 + a_{ k })]^α_{ k } / [Γ(i + 1 + b_{ k }) / Γ(1 + b_{ k })]^β_{ k }.
The asymptotic formulas for equilibrium frequencies of rational BDIM could be considered as particular cases of the corresponding formulas of general theorem 1. Proposition 3 allows one to calculate the constants in the corresponding asymptotic formulas for the sizes of domain families for a rational BDIM. If only equilibrium frequencies are analyzed, the values of these constants become irrelevant because they contract. However, if the actual values of f_{ i } and F_{ eq } are of interest, the values of the constants are required.
Properties of the main types of rational BDIM
Simple BDIM
As shown above, a simple BDIM can have equilibrium frequencies only of the form p_{ i } = C θ^{ i }/i, C = const;in particular, if the distribution parameter θ < 1, we get the (truncated) logarithmic distribution. Logarithmic distributions are seen in many biological contexts, e.g., the distribution of species by the number of individuals in populations or, what is more relevant, the distribution of protein folds by the number of families per fold [38]. Thus, a simple BDIM could be potentially used for modeling the dynamics of biological systems with a logarithmic distribution of equilibrium densities. We examine this possibility in greater detail starting with the case λ = δ (secondorder balanced simple BDIM).
We can extract from Proposition 2 some additional information, which could be helpful for estimating the model parameters. It is known that
F_{ eq } (ν/δ) [lnN + C_{E}] (5.1)
This means that, in the equilibrium state of the system, the total number of domain families grows only slowly (~ln N) with the increase of the maximal number (N) of domains in a family (which is equal to the maximal possible number of domain family size classes).
Furthermore, according to equation (2.3), in the equilibrium state of a simple BDIM ν/δ = f_{1}, so we have
F_{ eq } / f_{1} lnN + C_{E} (5.2)
Formula (5.1) can be used for estimating the model parameters on the basis of empirical data.
In the more general case λ ≠ δ, we can also obtain an estimate of the rate of innovation ν. If λ < δ (θ < 1), then the series in the right part of (3.5) quickly converges,
F_{ eq } = (ν/δ) θ^{i1}/i = (ν/λ) θ^{ i }/i ν/λ (ln(1θ)),
and
ν/δ = F_{ eq } θ/(ln(1θ)). (5.3)
Taking into account that ν/δ = f_{1} (2.3), we have a relation
F_{ eq }/f_{1} ln(1θ)/θ, (5.4)
which allows the parameter θ to be estimated on the basis of empirical data.
If N can be estimated independently and is not very large, we can use more exact relations:
Further, if (1θ)N is small (i.e., θ is very close to 1), then the approximation
has an error less then [N(1θ)]^{2}/4 and, in this case,
F_{ eq }/f_{1} (C_{ E }  N(1  θ))/θ. (5.5)
If (1  θ)N is large, then the following inequalities provide simple bounds for F_{ eq }/f_{1} = θ^{i1}/i:
 (ln(1θ)/θθ^{ N }/[(N+1)(1θ)] < θ^{i1}/i < ln(1θ)/θθ^{ N }[1/(N+1)θ/(N+2)]. (5.6)
For the simple BDIM, the ratio of the rate of duplications to the innovation rate is
G(N) = λ_{ i }f_{ i }/ν = θ^{ i } = θ(1θ^{N1})/(1θ),
so G(N) → ∞ if θ > 1 and G(N) → 1/(1θ) if θ < 1 at N→∞.
If the simple BDIM is the 2^{nd} order balanced, θ = 1, then G(N) = N  1.
Thus, for the simple, secondorder balanced BDIM, the number of duplications per time unit is N1 times greater than the number of innovations.
The total number of domains in the equilibrium state for the simple BDIM is
M(N) = if_{ i } = ν/λθ(1θ^{ N })/(1θ).
If a simple BDIM is secondorder balanced, then G(N) = ν/λ N.
Linear BDIM
We saw that the assumption of independence of birth and death rates of individual domains on each other and on the size of domain families is incompatible with any power distribution of the equilibrium frequencies with the degree not equal to 1. The simplest case of a BDIM, which can have, depending on the parameters, three types of asymptotic behavior described by Theorem 1 (excluding the first one, hyperexponential, which corresponds to a nonbalanced BDIM; all linear BDIMs are balanced) and, in particular, any power asymptotics, is a model with linear birth and death rates of the form:
λ_{ i } = λ (i + a), δ_{ i } = δ (i + b), where a and b are constants. (5.7)
Corollary 1 of Theorem 2 implies that equilibrium frequencies p_{ i } of a linear BDIM have asymptotics
p_{ i } ~ θ^{ i }i^{ab1}, where θ = λ/δ. (5.8)
In particular, if λ ≠ δ and a = b, the linear BDIM is firstorder balanced and the equilibrium frequencies p_{ i } follow the logarithmic distribution (in this case, the linear BDIM is asymptotically equivalent to the simple BDIM). If λ = δ, the linear BDIM is secondorder balanced and the equilibrium frequencies p_{ i } follow the power distribution
p_{ i } ~ i^{ab1}. (5.9)
Thus, the dependence of the domain frequency on the family size is actually determined by the difference a  b. If a >b, the birth rate decreases faster than the death rate with the increase of family size, i. e. there seems to be a "competition" between domains in a family; in contrast, if a <b, the death rate drops faster, i.e. a "synergy" between domains appears to exist (Fig. 4).
More detailed information can be obtained using Proposition 4:
i) for a firstorder balanced linear BDIM, the equilibrium sizes f _{ i } of domain families are
f_{ i } = c ν/δθ^{i1}Γ(i + a)/(Γ(i + 1 + b)) for all i
where
c = Γ (1 + b)/Γ (1 + a)
and the total number of domain families at equilibrium is
F_{ eq } = c ν/δ[ θ^{j1}Γ(j + a) / (Γ(j + 1 + b)]. (5.10)
ii) for a secondorder balanced linear BDIM (θ = 1),
f_{ i } = c_{1}ν/δ Γ (i + a)/Γ (i + 1 + b)
and
According to (2.3), in the equilibrium state of a linear BDIM, f_{1} = ν/δ_{1} = ν/(δ(1 + b)) and so, for a secondorder balanced linear BDIM, we have the formula
Suppose that equilibrium frequencies obtained from empirical data follow the power distribution p_{ i } ~ i^{γ}; in this case, γ is the slope of the empirical curve (lnf_{ i } versus lni) and can be estimated from the data. Assuming that the system is well described by a linear BDIM, it follows from (5.9) that a  b = 1  γ and λ = δ. Thus,
f_{ i } = c ν/δ Γ (i + a)/Γ (i + a + γ), where c = Γ (γ + a)/Γ (1 + a), (5.12)
and
where a is the single free parameter.
For the linear secondorder balanced BDIM, the ratio of the birth rate to the innovation rate is
if 1 + a  b ≠ 0. As
if 1 + a  b < 0 and G(N)→∞ if 1 + a  b > 0 at N→∞.
The case 1 + a  b = 0 (slope of the asymptote in double logarithmic coordinates equal to a  b  1 = 2) is a critical one.
In this case,
G(N) = Γ(1 + b) / Γ(b) Γ (i + b) / (Γ (i + 1 + b) =
Accordingly, G(N)→∞ at N→∞.
The total number of domains in the equilibrium state for a secondorder balanced linear BDIM is
If the slope of the asymptote γ = 1, the linear secondordered BDIM shows the same asymptotic behavior as a simple BDIM (2.1), but behaves differently at small i. If γ ≠ 1, the system cannot be described by a simple BDIM even asymptotically, but can be described by a linear BDIM. As indicated above, in this case, the average perdomain birth and death rates depend on the size of the domain family and the difference ab characterizes this dependence.
Quadratic BDIM
The linear BDIM takes into account the dependence of average birth and death rates of individual domains on the size of domain family, but does not imply a specific form of interaction between domains. Let us consider the simplest, pairwise interaction, which leads to λ_{ i } ~ i^{2} and/or δ_{ i } ~ i^{2}, i.e. one or both rates are polynomials on i of the second degree. If these degrees are different (i.e., λ_{ i } ~ i and δ_{ i } ~ i^{2}), then the corresponding BDIM is nonbalanced and equilibrium frequencies have hyperexponential asymptotics. Thus, let
λ_{ i } = λ (i^{2} + r_{1}i + r_{2}), δ_{ i } = δ (i^{2} + q_{1}i + q_{2}), (5.13)
where r_{ k }, q_{ k }, k = 1,2 are constants (such that λ_{ i }, δ_{ i } are positive for all i) or
λ_{ i } = λ (i + a_{1})(i + a_{2}),
δ_{ i } = δ (i + b_{1})(i + b_{2})
Then, r_{1} = a_{1} + a_{2}, q_{1} = b_{1} + b_{2}, and
χ (i) = λ_{i1}/δ_{ i } = θ (1 + (r_{1}q_{1}2)/i + O(1/i^{2})),
where θ = λ/δ.
According to theorem 3 and Proposition 3, the quadratic BDIM with rates (5.13) has equilibrium sizes of domain families
f_{ i } = c_{2} ν/δ θ^{i1} Γ (i + a_{1}) Γ (i + a_{2}) / (Γ (i + 1 + b_{1}) (Γ (i + 1 + b_{2})) c_{2}ν/δ θ^{i1}i^{ρ2} (5.14)
where ρ = r_{1}  q_{1} and the constant c_{2} = [(Γ (1+b_{1}) Γ (1+b_{2})] / [Γ (1+a_{1}) Γ (1+a_{2})], and the total number of domain families at equilibrium
F_{ eq } = c_{2}ν/δ ( θ^{j1} Γ(j+a_{1}) Γ(j+a_{2}) / (Γ(j+1+b_{1}) (Γ(j+1+b_{2})). (5.15)
Note that the asymptotic behavior of frequencies p_{ i } does not depend on free coefficients r_{2}, q_{2} in (5.13), but only on θ and r_{1}q_{1} (as follows from (5.14)), although the values of f_{ i } are proportional to the constant c_{2}, which could depend on the free coefficients r_{2}, q_{2}. Let us consider the case r_{2} = q_{2} = 0 in more detail.
If only square terms are present in the expressions for the birth and death rates, λ_{ i } = λi^{2}, δ_{ i } = δi^{2}, then a_{ k } = b_{ k } = 0, k = 1,2 and so c_{2} = 1, f_{ i } = ν/δ θ^{i1}/i^{2} and F_{ eq } = ν/δ θ^{j1}/j^{2}. So at N→∞
F_{ eq } ν/δ θ^{j1} / j^{2} = ν/λ Polylog(2,θ) (5.16)
where Polylog is a special function, Polylog(k,x) = x^{ j }/j^{ k }.
According to (3.2), f_{1} = ν/δ_{1}; for this particular case of quadratic BDIM, f_{1} = ν/δ and
F_{ eq }/f_{1} Polylog(2,θ). (5.17)
Formula (5.17) allows estimating parameter θ from empirical data if N is large enough.
More precisely, F_{ eq } = ν/λ θ^{ j }/j^{2} = ν/λ (Polylog(2,θ)θ^{1+N} LerchPhi(θ,2,1+N)), where LerchPhi is a special function (these and other special functions used below can be computed using program packages Mathematika or Maple).
If, additionally, θ = 1 (the BDIM is secondorder balanced), then
f_{ i } = (ν/δ)/i^{2} = f_{1}/i^{2} (5.18)
and, at large N
F_{ eq } ν/δ π^{2}/6 1.645 ν/δ = 1.645f_{1}. (5.19)
From formulas (5.8), (5.15), we can extract some additional information, which could be helpful for estimating the model parameters at relatively small N. Let us recall definitions of some special functions.
The digamma function φ(z) is logarithmic derivative of Γ(z), φ(z) = Γ'(z)/Γ(z).
The function PolyGamma(n,z) is n^{ th } derivative of φ(z), PolyGamma(n,z) = d^{ n }φ(z)/dz^{ n }, such that φ(z)= PolyGamma(0,z).
It is known that
Thus we have
F_{ eq } = ν/δ 1/j^{2} = ν/δ [π^{2}/6PolyGamma(1,1+N)] (5.20)
F_{ eq }/f_{1} = π^{2}/6PolyGamma(1,1+N),
which can be used for estimating unknown parameters of the model.
The values of PolyGamma(1,x) are tabulated and can be computed using standard program packages; for a rough preliminary estimate, PolyGamma(1,x) = 1/x+1/2x^{2}+O(1/x^{4}).
If linear terms are also present in the quadratic BDIM, λ_{ i } = λ (i^{2}+a_{1}i), δ_{ i } = δ (i^{2}+b_{1}i), then
f_{ i } = c_{2}ν/δ θ^{i1}/i Γ (i+a_{1})/Γ (i+1+b_{1})
where c_{2} = Γ (1+b_{1})/Γ (1+a_{1}); F_{ eq } = Σf_{ i } can be computed using special functions. In particular, if the BDIM is secondorder balanced, θ = 1, then
f_{ i } = c_{2}ν/δ Γ (i+a_{1}) / (i Γ (i+1+b_{1})).
For this variant of the model, f_{1} = ν/δ_{1} = ν/(δ(1+b_{1})), and so
Polynomial BDIMs
The quadratic models take into account the dependence of birth and death rates of individual domains on the simplest, pairwise interactions. If interactions of higher orders are postulated, λ_{ i } ~ P_{ n }(i) and/or δ_{ i } ~ Q_{ m }(i), where P_{ n }(i), Q_{ m }(i) are polynomials on i of the nth and mth degrees. Again, if the degrees n and m are different, then the BDIM is nonbalanced and equilibrium frequencies have hyperexponential asymptotics. Thus, let n = m,
λ_{ i } = λR (i) = λ r_{ k }i^{mk}, δ_{ i } = δQ(i) = δ q_{ k }i^{mk} (5.21)
where r_{ k }, q_{ k } are constants and r_{0} = q_{0} = 1. We suppose, of course, that R(i), Q(i) are positive for all integer i. Note that, in this case, χ (i) ≡ λ_{i1}/δ_{ i } = θ (1+(r_{1}  q_{1}  m)/i+O(1/i^{2})), where θ = λ/δ. We will suppose that θ ≤ 1.
According to Theorem 3, the polynomial BDIM with rates (5.21) has equilibrium sizes of domain families with powerexponential asymptotics
f_{ i } ~ θ^{ i }i^{ρm} (5.22)
where ρ = r_{1}  q_{1}.
In particular, if ρ  m > 1, the equilibrium frequencies p_{ i } follow the Pascal distribution with parameters (ρ  m + 1, θ);
if ρ  m = 1, the equilibrium frequencies p_{ i } follow the (truncated) logarithmic distribution;
if ρ  m = 0, the equilibrium frequencies p_{ i } follow the geometric distribution;
if λ = δ, the polynomial BDIM is secondorder balanced and the equilibrium frequencies p_{ i } follow the power distribution
p_{ i } ~ i^{ρm}. (5.23)
Note that the degree of the power distribution (5.23) depends only on m, the common degree of the polynomials (5.21), and on ρ, the difference between the coefficients r_{1} and q_{1}, and does not depend on other coefficients. In particular, if r_{1} = q_{1}, then p_{ i } ~ i^{m}. This relation could be interpreted as follows: if the first two coefficients of polynomial rates λ_{ i } and δ_{ i } are equal, then the degree of the power distribution (5.19) is equal to the "order of interactions" of domains.
Formula (5.22) can be refined. Let R(i) = (i+a_{ k }), Q(i) = (i+b_{ k }).
Then (see Proposition 3) the equilibrium numbers of domain families f_{ i } of the polynomial BDIM (5.18) are
f_{ i } = C ν/δθ^{i1} [Γ(i+a_{ k })/Γ(i+1+b_{ k })]
where C = [Γ(1+b_{ k })/Γ(1+a_{ k })], and the equilibrium total number of domain families
F_{ eq } = C ν/δ θ^{j1} [Γ(j+a_{ k })/Γ(j+1+b_{ k })].
For the polynomial model f_{1} = ν/δ_{1} = ν/(δ q_{ k }), so
F_{ eq }/f_{1} = C θ^{j1} (Γ(j+a_{ k })/Γ(j+1+b_{ k }))/ q_{ k }.
This formula can be used for estimating the model parameters.
For the polynomial secondorder balanced BDIM, the ratio of the death rate to the innovation rate is
G(N) = λ_{ i }f_{ i }/ν = ( Γ(1+b_{ k })/Γ(1+a_{ k })) Γ(i+1+a_{ k })/Γ(i+1+b_{ k }) =
Approximation of the observed domain family size distributions in prokaryotic and eukaryotic genomes with different BDIMs
Having developed the mathematical theory of BDIMs, we sought to determine which of these models, if any, adequately described the empirical data on domain family size distribution. To identify the domain sets of domains encoded in each of the genomes, the CDD library of positionspecific scoring matrices (PSSMs), which includes the domains from the Pfam and SMART databases, was used in RPSBLAST searches [12] against the protein sequences from a set of completely sequenced eukaryotic and prokaryotic genomes http://0www.ncbi.nlm.nih.gov.brum.beds.ac.uk/entrez/query.fcgi?db=Genome. The CDD domain library is partially redundant, so when the results obtained from individual PSSMs showed significant overlap (more than 50% of hits overlapped for more than 50% of their length), the corresponding domains were examined casebycase for redundancy. PSSMs representing structurally similar domains and producing overlapping lists of hits were joined into "synonymy clusters".
Domain families in sequenced genomes and parameters of the bestfit secondorder balanced linear BDIM
No. of ORFs in genome  No. of detected domain families  No. of detected domains  No. of ORFs with RPSBLAST hits  Maximum size of a family  f_{1} (f'_{1})  a  b  k  ν/δ = ν/λ  G = λ_{ i }f_{ i }/ν  

Sce^{b}  6340  1080  4575  3331  130  420 (436)  1.55  3.27  2.72  1861.8  [3.28..3.53] 
Dme  13605  1405  11734  7262  335  426 (435)  1.62  2.79  2.17  1648.2  [8.44..15.50] 
Cel  20524  1418  17054  11090  662  423 (421)  1.13  2.03  1.89  1273.0  [16.18..∞ ] 
Ath  25854  1405  21238  15006  1535  270 (277)  3.80  4.98  2.18  1657.7  [17.09..26.89] 
Hsa  39883  1681  27844  16755  1151  298 (288)  5.16  6.43  2.27  2136.2  [17.14..22.88] 
Tma  1846  772  1683  1268  97  501 (499)  0.14  2.22  3.08  1606.4  [1.04..1.06] 
Mth  1869  693  1480  1150  43  438 (436)  0.12  2.00  2.88  1305.3  [1.18..1.27] 
Sso  2977  695  1950  1614  81  386 (385)  0.36  2.04  2.68  1167.8  [1.83..2.00] 
Bsu  4100  1002  3413  2502  124  507 (510)  0.48  2.01  2.53  1534.6  [2.46..2.79] 
Eco  4289  1078  3624  2765  140  523 (519)  0.84  2.54  2.70  1837.0  [2.45..2.61] 
To enable statistical analysis using the χ^{2}method for the entire range of the data, including the sparsely distributed classes corresponding to large families, the data needed to be combined. For each genome, the observed domain family frequencies were grouped into bins, each containing at least 10 families; typically, bins corresponding to families with small number of members included a single size class (e.g. all singlemember families, twomember families etc), whereas bins corresponding to large families may span hundreds of size classes, most of them empty. Theoretical distribution values for a bin combining observed data from mth to nth class were computed as , where f'_{ i } is the predicted number of families in the ith class and depends on the model parameters. Since the model displays only a weak dependence on the maximum number of domains in a family (N), instead of including N as a model parameter, the sum (where i_{ max } is the number of domains in the most abundant of the detected families), was normalized to equal the total number of families detected in the given genome (a requirement for the χ^{2} analysis). χ^{2} values were computed to measure the quality of fit between the observed and the theoretical distributions. The distribution parameters (θ for the simple BDIM, a and b for the secondorder balanced linear BDIM) were adjusted to minimize the χ^{2} value.
Fitting the observed domain family size distribution with the secondorder balanced linear BDIM resulted in positive values of the parameters a and b, with a <b, for all analyzed genomes (Table 1). Accordingly, domain family size distributions in all cases asymptotically tend to the power law with the power k < 1 and, for all species with the exception of C. elegans, k < 2 (Table 1 and Fig. 8). As discussed above, this seems to indicate the existence of "synergy" between domains in a family whereby the likelihood of survival is greater for a domain that belongs to a large family than for a domain from a small family (Fig. 5). For all species, we find that the innovation rate is approximately three orders of magnitude greater than the per domain birth (death) rate. Accordingly, the total per genome birth (duplication) rate is comparable to but, typically, several times greater than the innovation rate (Table 1). The ratio of the per genome birth rate to the innovation rate increases with the number of genes in a genome or the number of detected domains, with nearly identical rates seen for small prokaryotic genomes and values as high as 20 for the largest plant and animal genomes (Table 1).
The data used to fit the BDIM typically included 50–60% of the proteins encoded in a given genome (Table 1); the remaining proteins were not represented by sufficiently similar domains in the current CDD collection. It cannot be ruled out that the fit would be significantly affected as a result of including all proteins encoded in the genome, in case the proteins currently not recognized in CDD searches have a family size distribution substantially different from that of the recognized ones. However, secondorder balanced linear BDIM can accommodate considerable perturbations of the distribution through adjustment of the parameters, so we believe that this model is likely to approximate well also the size distribution of domain families for complete sets of proteins encoded in a genome. An alternative approach that at least partially circumvents the sampling problem involves analysis of all families of paralogs detectable using clustering by sequence similarity, with employing a predefined library of domains; this analysis is beyond the scope of the present work but may be a subject of further investigation.
General discussion and conclusions
Here, we presented a complete mathematical description of the size distribution of protein domain families encoded in genomes for simple but not unrealistic models of evolution, which include three types of events: domain duplication (birth), domain elimination (death), and domain innovation. In biological terms, innovation could involve gene acquisition via horizontal gene transfer, emergence of a new domain from a noncoding sequence or a nonglobular protein sequence, or major modification of a domain obliterating its connection with a preexisting family. Innovation via horizontal gene transfer appears to be particularly common in prokaryotes [32, 39], which might account for the apparent higher relative innovation rate in prokaryotic genomes observed in the present analysis (Table 1).
We showed that birthdeathinnovation models (BDIMs) with different levels of complexity lead to readily distinguishable predictions regarding the distribution of domain family sizes in genomes. In particular, we defined the exact analytic conditions that lead, exactly or asymptotically, to power law distributions, which have recently received ample attention, as they were uncovered in various biological and social contexts [20, 25]. In contrast to previous analyses [16, 17, 30] but in agreement with the results of a recent reexamination [31], we showed that the power law only asymptotically approximates the domain family size distributions.
Three groups of observations made in this work seem to have the greatest potential of enhancing our understanding of genome evolution and, perhaps, evolution of other complex systems. First, we proved that, within the BDIM framework, there is a unique equilibrium state of the system, which is approached exponentially, with respect to time, from any initial state. In this equilibrium state, the number of domain families in each size class remains constant and follows a unique distribution depending on the type and parameters of the BDIM. In particular, power asymptotics emerges when and only when a BDIM is secondorder balanced, i.e. the rates of domain birth and death are asymptotically equal. Since we showed that the observed size distributions of domain families in all analyzed genomes indeed tend to power law asymptotics, the results are compatible with the notion that the genomes are close to a steady state with respect to the domain diversity (F_{ eq }, the number of distinct domain families at equilibrium, under the using the BDIM convention) and distribution (f_{ i }). Taking a broader biological perspective, this result might indicate that evolving lineages go through lengthy periods of relative stasis when the level of genomic complexity remains more or less the same. Under this view, the stasis epochs are punctuated by relatively short periods of dramatic changes when the complexity either greatly increases (the emergence of eukaryotes is the most obvious case in point) or decreases (e.g. evolution of parasites). These bursts of evolution might be described as transitions between different BDIMs in the parameter space, with some of the trajectories potentially involving nonbalanced BDIMs. The analogy between this emerging picture of genome evolution and the punctuated equilibrium concept of species evolution, which has been developed through analysis of the paleontological record [40], is obvious.
Second, we showed that the simplest model that adequately describes the observed domain family size distributions is the secondorder balanced linear BDIM; in contrast, simple BDIMs do not show a good fit to the observations. This has potentially important implications for the mode of domain family evolution. Simple BDIMs are based on the notion that the likelihood of duplication (birth) or elimination (death) of a domain is uniform across the genome and does not depend on the size or other characteristics of domain families (the independence assumption). Clearly, under the independence assumption, a duplication (birth) as well as elimination (death) of a domain is more likely to occur in a large family than in a small one, but only because the overall probability of such an event is proportional to the number of family members, whereas the birth (death) rate per domain remains the same. The key observation of this work, that the actual domain frequency distributions are well described by a linear but not by a simple BDIM, suggests that the independence assumption is an oversimplification. Instead, the linear BDIM includes parameters that describe the dependence of the per domain birth (death) rate on the family size. The asymptotics of the theoretical distribution that is the best fit for the actual data is a power law, with the power equal to ab1, where a and b are the parameters of a linear BDIM. We observed that, for all analyzed genomes, ab1 < 1 (a <b), which corresponds to "synergy" between domains in a family. Both the domain birth rate and the death rate drop with the increase of the size of a domain family, but the death rate decreases faster (Fig. 5). In general terms, this suggests that small families are more dynamic during evolution than large families. In particular, under the BDIM formalism, innovation contributes only to singlemember families (class 1), which have the greatest evolutionary mobility, and either quickly proliferate and are stabilized or perish. An implication of these observations is that, in general, large families are older than small ones. Exceptions to this generalization, i.e. the existence of small, ancient families, probably point to selection for a specific family size; for example, it seems likely that selection acts against proliferation of certain essential proteins, e.g. ribosomal proteins, which typically form singlemember families [41]. Another pertinent observation is that the linear BDIM seems to adequately accommodate even the largest of the identified domain families. Lineagespecific expansion of paralogous families appears to be one of the principal modes of organismic adaptation during evolution [13, 14, 42]. Thus, quantitatively, adaptive family expansion appears to fit within the BDIM framework, although these models do not explicitly incorporate the notion of selection. Of course, for BDIMs, it is irrelevant which families expand, and this choice is determined by selection.
Third, the BDIM equilibrium condition with respect to the total number of domain families, ν = δ_{1}f_{1} (ν is the innovation rate, δ_{1} is the domain death rate for class 1 families, and f_{1} is the number of domain families in class 1) allows us to estimate the ratio between domain innovation rate and the domain death and birth rates. Indeed, according to the above and the definition of a secondorder linear BDIM, which is the best fit for the actual data, λ = δ = ν/f_{1}(1+b). Since the number of domain families in class 1 (families with only one member) is in the hundreds for each genome, this translates into an innovation rate that is much greater than the duplication or elimination rate per domain (Table 1). Such high innovation rates might appear counterintuitive, but let us note that the duplication rate over all domain families is a number that tends to be nearly identical to ν for small prokaryotic genomes and severalfold greater than ν for large eukaryotic genomes (Table 1). Thus, under the secondorder balanced linear BDIM, the likelihood of appearance of a new domain in a genome is close to or several times less than the likelihood of a duplication or elimination of an existing domain. Nevertheless, the finding that the innovation rate is comparable to the overall duplication/elimination rate seems surprising. If the linear BDIM is indeed a realistic evolutionary model, this emphasizes the critical role of innovation in maintaining the balance (steady state) in genome evolution. In prokaryotes, innovation via horizontal gene transfer appears to be particularly extensive [32, 39], which might underlie the greater relative innovation rate in these organisms (Table 1).
As already indicated, BDIMs do not explicitly incorporate selection. However, the present analysis shows that only models with precisely balanced domain birth, death and innovation rates can account for the observed distribution of domain family size in each of the analyzed genomes. It seems likely that the balance between these rates is itself a product of selection. There is little doubt that BDIMs will be eventually replaced by more sophisticated formalisms that will more realistically capture the mechanisms of genome evolution. Nevertheless, even the crude modeling described here seems to reveal several potentially interesting and nontrivial aspects of the evolutionary process.
Mathematical Appendix. Proofs of some statements
Proof of Proposition 1
When system (3.1) is solved consecutively from the last equation to the second one, it becomes obvious that the solution is unique up to a constant multiplier.
Next, if f_{ i } = f_{i1}λ_{i1}/δ_{ i }, f_{i+1} = f_{i1}λ_{i1}λ_{ i }/(δ_{ i }δ_{i+1}), then the substitution shows that (f_{i1},f_{ i },f_{i+1}) satisfy the ith equation of system (3.2). Substituting f_{2} = f_{1}λ_{1}/δ_{2} in the first equation, we get f_{1} = ν/δ_{1} and, consequently, for all i = 2,...N. By definition, F_{ eq } = f_{ i }, so we have (3.3).
Since system (2.2) is linear, the equilibrium state (f_{1},...f_{ N }) is asymptotically stable if the real parts of all characteristic values of the matrix
are negative.
The following theorem (see [43]) gives the desired criterion: the real part of all the characteristic values of a real matrix C = c_{ ij }, i,j = 1,..n with nonnegative nondiagonal elements are negative if and only if (1)^{ k }D_{ k } > 0 for all k = 1,..n, where D_{ k } is the main minor of the matrix C of the kth order.
To apply this theorem, let us consider the n × n matrix, n ≤ N
It is easy to see that
det B_{ n } = (λ_{ n } + δ_{ n })det B_{n1}  λ_{n1}δ_{ n } det B_{n2}, (A1)
det A_{ n } = δ_{ n }det B_{n1}  λ_{n1}δ_{ n } det B_{n2}.
Using these equalities, we can prove that for any n
det A_{ n } = (1)^{ n }δ_{ n }δ_{n1...} δ_{2}δ_{1}.
Indeed,
det A_{ n } = δ_{ n } det B_{n1}λ_{n1}δ_{ n } det B_{n2}=
δ_{ n }((λ_{n1}+δ_{n1}) det B_{n2} + λ_{n2}δ_{n1} det B_{n3})  λ_{n1}δ_{ n } det B_{n2}=
δ_{ n }δ_{n1} (det B_{n2} + λ_{n2} det B_{n3})= (subsequently using (A1))=
(1)^{n2}δ_{ n }δ_{n1...} δ_{3}(det B_{2} + λ_{2} det B_{1}) = (1)^{ n }δ_{ n }δ_{n1...} δ_{2}δ_{1}.
Further, it is easy to see that for any n
det B_{ n } = det A_{ n }  λ_{ n } det B_{n1}.
Taking into account that B_{1} = (λ_{1} + δ_{1}) < 0 and that the sign of det A_{ n } coincides with (1)^{ n }, it is easy to prove that
det J_{ n } > det A_{ n } if det A_{ n } > 0 and det J_{ n } < det A_{ n } if det A_{ n } < 0.
Thus, the sign of det B_{ n } coincides with the sign of det A_{ n } and so (1)^{ n }B_{ n } > 0 for all n = 1,..N. According to the aforementioned theorem, the real parts of all the characteristic values of a real matrix A_{ N } are negative and so the single equilibrium is asymptotically stable, QED.
Proof of Proposition 2
For simple BDIM (2.1) f_{ i } = ν λ_{ k } / δ_{ k } = (ν/δ)θ^{i1}/i = (ν/λ)θ^{ i }/i, so
F_{ eq } = f_{ i } = ν/λ θ^{ i }/i, and
p_{ i } = f_{ i }/F_{ eq } = (θ^{ i }/i)/ θ^{ j }/j.
If a simple BDIM is balanced, then θ = 1 and so
F_{ eq } = ν/λ θ^{ j }/j.
p_{ i } = ν/λ F_{ eq }/i = 1/i ( 1/j)^{1}.
Proof of Theorem 1
The condition (3.10) can be rewritten as λ_{i1}/δ_{ i } = i^{ s }θ(1+a/i+O(1/i^{2})) = i^{ s }θ (1+a/i)(1+O(1/i^{2})). Thus, we can choose S in such a way that (1 + O(1/s^{2})) converge, 0 < (1 + O(1/s^{2})) < ∞. It follows that
According to Proposition 1, p_{ i } = f_{ i }/F_{ eq } ~ λ_{ k } / δ_{ k }. So
p_{ i } ~ (λ_{s1}/δ_{ s }) ~ Γ(i)^{ s } θ^{ i } (1+a/s) = Γ(i)^{ s }θ^{ i }(i+a+1)/Γ(i+1).
Applying the main asymptotic property of Γfunction, i.e. Γ (i+c)/Γ(i)~i^{ c } at large i for any c, we have
Γ (i+a+1)/ Γ (i+ 1) ~ i^{ a }, and so p_{ i } ~ Γ (i)^{ s } θ^{ i }i^{ a }.
Proofs of Corollaries 1–3
If a discrete random variable ξ has the Pascal distribution, then
P(ξ = i) 1 / Γ (r) (1q)^{ r }i^{r1}q^{ i } ~ q^{ i }i^{r1} for large i,
and it becomes evident that, for a > 1, equilibrium frequencies p_{ j } of the firstorder balanced BDIM follow the Pascal distribution with parameters (a+1,θ).
If a = 1, then p_{ i } ~ θ^{ i }/i and so p_{ i } follows the truncated logarithmic distribution with parameter θ. If a = 0, then p_{ j } ~ θ^{ i } and p_{ i } follows the geometric distribution.
Further, p_{ i } ~ i^{ a }, that is the sequence p_{ i } follows the power distribution with the power a, if and only if θ = 1, that is, if the BDIM is secondorder balanced.
Finally, if λ_{i1}/δ_{ i } = 1 + O(1/i^{2}), that is, if θ = 1 and a = 0, then p_{ i } ~ const; in particular, if λ_{i1} = δ_{ i } for all i, then, according to Proposition 1, f_{ i } = ν for all i and p_{ i } = 1/N.
Proof of Theorem 2
According to Proposition 1, system (3.1) has the unique solution:
f_{1} = νδ_{1}, f_{ i } = ν λ_{ s } / δ_{ s } for all i = 2,...N. So
f_{ i } = ν/λθ^{ i } P(s) / Q(s), i > 1.
Applying the Lemma (see below), we get
f_{ i } C ν/λ θ^{ i } Γ (i)^{η}i^{ρβ}, as i→∞,
where the constant C = [(Γ(1+b_{ k }))^β_{ k }] / [Γ(1+a_{ k })^α_{ k }].
Lemma. Let P(j) = (j+a_{ k })^α_{ k }, Q(j) = (j+b_{ k })^β_{ k }, where a_{ k }, b_{ k } are positive. Let us denote
Then with fixed j
N(j) = P(s) / Q(s) C Γ(j)^{η}j^(ρβ)
as j→∞, where
C = [(Γ(1+b_{ k }))^β_{ k }] / [Γ(1+a_{ k })^α_{ k }].
Proof.
N(j) = { [Γ(j+a_{ k }) / Γ(1+a_{ k })]^α_{ k }}/{ [Γ(j+1+b_{ k })/Γ(1+b_{ k })]^β_{ k }}=
C [(Γ(j+a_{ k }))^α_{ k }]/ [(Γ(j+1+b_{ k }))^β_{ k }]
where
C = [(Γ(1+b_{ k }))^β_{ k }]/ [Γ(1+a_{ k })^α_{ k }].
Let us use the known asymptotic relation
Γ (t+a)/Γ (t) t^{ a } with t→∞.
Thus we have
(Γ(j))^{η} [(Γ(j+a_{ k }) / Γ(j))^α_{ k }] / [(Γ(j+1+b_{ k }) / Γ(j))^β_{ k }]
(Γ(j))^{η}j^[ a_{ k } α_{ k }] / j^[ (b_{ k }+1)β_{ k }]=
(Γ(j))^{η}j^(ρβ),
and Lemma is proved.
Proof of Proposition 3
It follows from the proof of the Lemma that
f_{ i } = C ν/λ θ^{ i } [(Γ(j+a_{ k }))^α_{ k }] / [(Γ(j+1+b_{ k }))^β_{ k }] for i > 1,
where C = [(Γ(1+b_{ k }))^β_{ k }] / [(Γ(1+a_{ k }))^α_{ k }].
Let us show that f_{1} can be expressed by the same formula if i = 1. Indeed,
C ν/δ [(Γ(1+a_{ k }))^α_{ k }] / [(Γ(1+1+b_{ k }))^β_{ k }=
Thus,
F_{ eq } = C ν/δ ( θ^{j1} (Γ(j+a_{ k }))^α_{ k }/ (Γ(j+1+b_{ k }))^β_{ k }).
QED.
Contributions of individual authors
GPK developed most of the mathematical formalism and wrote the draft of the mathematical part of the manuscript; YIW performed the identification of domain in sequenced genomes and the statistical analysis of the resulting distributions and wrote the draft of the corresponding part of the manuscript; FSB proved some of the theorems; AYR largely incepted the work and contributed to the formulation of the models; EVK contributed to the inception of the work and the formulation of the models, gave the biological interpretation of the results, wrote the background and discussion sections and extensively edited the entire manuscript.
Declarations
Acknowledgements
We thank Alexei Kondrashov, Alexei Ogurtzov, and Vladimir Ponomarev for critical reading of the manuscript and the Koonin group members for helpful discussions.
Authors’ Affiliations
References
 Koonin EV, Aravind L, Kondrashov AS: The impact of comparative genomics on our understanding of evolution. Cell. 2000, 101: 573576.View ArticlePubMedGoogle Scholar
 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, StangeThomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860921. 10.1038/35057062.View ArticlePubMedGoogle Scholar
 Dacks JB, Doolittle WF: Reconstructing/deconstructing the earliest eukaryotes: how comparative genomics can help. Cell. 2001, 107: 419425.View ArticlePubMedGoogle Scholar
 Chervitz SA, Aravind L, Sherlock G, Ball CA, Koonin EV, Dwight SS, Harris MA, Dolinski K, Mohr S, Smith T, Weng S, Cherry JM, Botstein D: Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science. 1998, 282: 20222028. 10.1126/science.282.5396.2022.PubMed CentralView ArticlePubMedGoogle Scholar
 Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, Cherry JM, Henikoff S, Skupski MP, Misra S, Ashburner M, Birney E, Boguski MS, Brody T, Brokstein P, Celniker SE, Chervitz SA, Coates D, Cravchik A, Gabrielian A, Galle RF, Gelbart WM, George RA, Goldstein LS, Gong F, Guan P, Harris NL, Hay BA, Hoskins RA, Li J, Li Z, Hynes RO, Jones SJ, Kuehl PM, Lemaitre B, Littleton JT, Morrison DK, Mungall C, O'Farrell PH, Pickeral OK, Shue C, Vosshall LB, Zhang J, Zhao Q, Zheng XH, Zhong F, Zhong W, Gibbs R, Venter JC, Adams MD, Lewis S: Comparative genomics of the eukaryotes. Science. 2000, 287: 22042215. 10.1126/science.287.5461.2204.PubMed CentralView ArticlePubMedGoogle Scholar
 Koonin EV, Tatusov RL, Rudd KE: Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. Proc Natl Acad Sci U S A. 1995, 92: 1192111925.PubMed CentralView ArticlePubMedGoogle Scholar
 Brenner SE, Hubbard T, Murzin A, Chothia C: Gene duplications in H. influenzae. Nature. 1995, 378: 14010.1038/378140a0.View ArticlePubMedGoogle Scholar
 Labedan B, Riley M: Widespread protein sequence similarities: origins of Escherichia coli genes. J Bacteriol. 1995, 177: 15851588.PubMed CentralPubMedGoogle Scholar
 Tatusov RL, Mushegian AR, Bork P, Brown NP, Hayes WS, Borodovsky M, Rudd KE, Koonin EV: Metabolism and evolution of Haemophilus influenzae deduced from a wholegenome comparison with Escherichia coli. Curr Biol. 1996, 6: 279291.View ArticlePubMedGoogle Scholar
 Ponting CP, Schultz J, Milpetz F, Bork P: SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. 1999, 27: 229232. 10.1093/nar/27.1.229.PubMed CentralView ArticlePubMedGoogle Scholar
 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, GriffithsJones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2002, 30: 276280. 10.1093/nar/30.1.276.PubMed CentralView ArticlePubMedGoogle Scholar
 MarchlerBauer A, Panchenko AR, Shoemaker BA, Thiessen PA, Geer LY, Bryant SH: CDD: a database of conserved domain alignments with links to domain threedimensional structure. Nucleic Acids Res. 2002, 30: 281283. 10.1093/nar/30.1.281.PubMed CentralView ArticlePubMedGoogle Scholar
 Jordan IK, Makarova KS, Spouge JL, Wolf YI, Koonin EV: Lineagespecific gene expansions in bacterial and archaeal genomes. Genome Res. 2001, 11: 555565. 10.1101/gr.GR1660R.PubMed CentralView ArticlePubMedGoogle Scholar
 Lespinet O, Wolf YI, Koonin EV, Aravind L: The role of lineagespecific gene family expansion in the evolution of eukaryotes. Genome Res. 2002, 12: 10481059. 10.1101/gr.174302.PubMed CentralView ArticlePubMedGoogle Scholar
 Aravind L, Watanabe H, Lipman DJ, Koonin EV: Lineagespecific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci U S A. 2000, 97: 1131911324. 10.1073/pnas.200346997.PubMed CentralView ArticlePubMedGoogle Scholar
 Huynen MA, van Nimwegen E: The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol. 1998, 15: 583589.View ArticlePubMedGoogle Scholar
 Qian J, Luscombe NM, Gerstein M: Protein family and fold occurrence in genomes: powerlaw behaviour and evolutionary model. J Mol Biol. 2001, 313: 673681. 10.1006/jmbi.2001.5079.View ArticlePubMedGoogle Scholar
 Rzhetsky A, Gomez SM: Birth of scalefree molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics. 2001, 17: 988996. 10.1093/bioinformatics/17.10.988.View ArticlePubMedGoogle Scholar
 Wuchty S: Scalefree behavior in protein domain networks. Mol Biol Evol. 2001, 18: 16941702.View ArticlePubMedGoogle Scholar
 Barabasi AL: Linked: The New Science of Networks. 2002, New York: Perseus PrGoogle Scholar
 Barabasi AL, Albert R: Emergence of scaling in random networks. Science. 1999, 286: 509512. 10.1126/science.286.5439.509.View ArticlePubMedGoogle Scholar
 Albert R, Barabasi AL: Statistical mechanics of complex networks. Reviews of Modern Physics. 2002, 74: 4797. 10.1103/RevModPhys.74.47.View ArticleGoogle Scholar
 Albert R, Jeong H, Barabasi AL: Error and attack tolerance of complex networks. Nature. 2000, 406: 378382. 10.1038/35019019.View ArticlePubMedGoogle Scholar
 Amaral LA, Scala A, Barthelemy M, Stanley HE: Classes of smallworld networks. Proc Natl Acad Sci U S A. 2000, 97: 1114911152. 10.1073/pnas.200327197.PubMed CentralView ArticlePubMedGoogle Scholar
 Gisiger T: Scale invariance in biology: coincidence or footprint of a universal mechanism?. Biol Rev Camb Philos Soc. 2001, 76: 161209. 10.1017/S1464793101005607.View ArticlePubMedGoogle Scholar
 Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in protein networks. Nature. 2001, 411: 4142. 10.1038/35075138.View ArticlePubMedGoogle Scholar
 Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL: The largescale organization of metabolic networks. Nature. 2000, 407: 651654. 10.1038/35036627.View ArticlePubMedGoogle Scholar
 Dorogovtsev SN, Mendes JF: Scaling properties of scalefree evolving networks: Continuous approach. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 2001, 63: 05612510.1103/PhysRevE.63.056125.Google Scholar
 Krapivsky PL, Redner S: Organization of growing random networks. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics. 2001, 63: 06612310.1103/PhysRevE.63.066123.Google Scholar
 Luscombe N, Qian J, Zhang Z, Johnson T, Gerstein M: The dominance of the population by a selected few: powerlaw behaviour applies to a wide variety of genomic properties. Genome Biol. 2002, 3: research0040.00410040.0047. 10.1186/gb200238research0040.View ArticleGoogle Scholar
 Kuznetsov VA: Statistics of the numbers of transcripts and protein sequences encoded in the genome. In: 'Computational and Statistical Approaches to Genomics. Edited by: Zhang W, Shmulevich I. 2002, Boston: Kluwer, 125171.Google Scholar
 Koonin EV, Makarova KS, Aravind L: Horizontal gene transfer in prokaryotes: quantification and classification. Annu Rev Microbiol. 2001, 55: 709742. 10.1146/annurev.micro.55.1.709.View ArticlePubMedGoogle Scholar
 Feller W: An introduction to probability theory and its application. New York: Wiley, 1967–1968Google Scholar
 Ijuri Y, Simon HA: Skew distributions and the sizes of business firms. 1977, Amsterdam, New York, Oxford: NorthHolland Publishing CompanyGoogle Scholar
 Gihman II, Skorohod AV: The theory of stochastic processes. 1975, NewYork, Heidelberg, Berlin: SpringerVerlagView ArticleGoogle Scholar
 Johnson NL, Kotz S, Kemp AW: Univariate discrete distributions. 1992, New York: WileyGoogle Scholar
 Henrici P: Applied and computational complex analysis. 1986, New York: WileyGoogle Scholar
 Wolf YI, Grishin NV, Koonin EV: Estimating the number of protein folds and families from complete genome data. J Mol Biol. 2000, 299: 897905. 10.1006/jmbi.2000.3786.View ArticlePubMedGoogle Scholar
 Lawrence JG, Ochman H: Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol. 1997, 44: 383397.View ArticlePubMedGoogle Scholar
 Gould SJ: The Structure of Evolutionary Theory. 2002, Cambridge, MA: Harvard Univ. PressGoogle Scholar
 Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genomescale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28: 3336. 10.1093/nar/28.1.33.PubMed CentralView ArticlePubMedGoogle Scholar
 Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and inparalogs from pairwise species comparisons. J Mol Biol. 2001, 314: 10411052. 10.1006/jmbi.2000.5197.View ArticlePubMedGoogle Scholar
 Gantmacher FR: The theory of matrices. 1989, New York: Chelsea Publishing CompanyGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.