Sage Journals: Discover world-class research

Abstract

Much is known regarding the structure and logic of genetic regulatory networks. Less understood is the contextual organization of promoter signals used during transcription initiation, the most pivotal stage during gene expression. Here we show that promoter networks organize spontaneously at a dimension between the 1-dimension of the DNA and 3-dimension of the cell. Network methods were used to visualize the global structure of E. coli sigma (σ) recognition footprints using published promoter sequences (RegulonDB). Footprints were rendered as networks with weighted edges representing bp-sharing between promoters (nodes). Serial thresholding revealed phase transitions at positions predicted by percolation theory, and nuclei denoting short steps through promoter space with geometrically constrained linkages. The network nuclei are fractals, a power-law organization not yet described for promoters. Genome-wide promoter abundance also scaled as a power-law. We propose a general model for the development of a fractal nucleus in a transcriptional grammar.

Keywords

power-law scaling promoter footprint systems biology transcription

Introduction

In prokaryotes, one of several sigma (σ) factors binds to a promoter upstream of a gene and helps position RNA polymerase during transcription initiation. Though consensus and canonical promoter motifs are frequently referenced in textbooks and the literature, genome-scale surveys have forced a reconsideration of the specific role played by these idealized sequences.^1–3 Actual promoters can vary in sequence considerably while still binding the same a, though efficiencies vary several-fold.⁴ Collectively these promoter sequences form a footprint in promoter space, defining a regulon of genes responsive to a particular environmental cue or cellular need. Each σ represents a hub, or highly connected node, in the overall gene regulatory network. Our concern in this study is with the structure of promoter variation, specifically the topology of a hub footprint.

Our use of networks to visualize promoter diversity departs from their traditional use in gene regulation research. Putting aside protein interaction networks (PINs), transcriptional interdependencies are visualized using two main approaches: (1) Most common is the gene regulatory network (GRNs), often generated using gene expression data, which conveys information on the realized interdependencies among genes.^5–8 Nodes represent genes, and certain of the protein products act as regulators of one or more of the genes in the network. Regulatory relationships are denoted by directed edges between nodes, and global studies of the transcriptome are now commonplace. (2) Studies that explicitly consider promoter diversity focus more on the nature and pattern of variation in the cis-element signals used to initiate transcription–- but here global or large-scale network approaches are not typical. For example, one promoter diversity study² examined the details of σ⁷⁰ promoter variation in E. coli, but did not render relationships as a network. Another study⁹ developed a regulatory network for acid resistance genes in E. coli, but theirs was a conceptual model. Another³ produced a hierarchical clustering model representing sequence similarities among 441 E. coli promoters, yet hierarchical trees carry the unnecessary constraint that cycles must be avoided in the rendering of network relationships.

Here we explore the structure of promoter networks from E. coli using affiliation-based subgraph extractions, or serial thresholding. Promoter predictions were obtained from RegulonDB and include three regulons mediating a type of stress response¹⁰ (σ²⁴, σ²⁸, and σ⁵⁴) along with the larger housekeeping σ⁷⁰. Networks were generated with edge weights representing the number of bases shared between pairs of promoter sequences (nodes). Rather than exploring the network in its totality as a weighted graph, we broke the network into a series of subgraphs based on edge weights and examined subgraph features separately. In particular, attributes of the LCC (largest connected component) of each network were tracked across a range of critical edge values.

We consider the following specific questions: (1) What is the apparent role, if any, of the consensus promoter motif? What is the frequency of predicted promoters in the genome? (2) What is the topological structure of variation across promoter sequences in a regulon of genes, and does this structure vary across regulons? How does the organization of predicted promoter networks compare to that of networks built from random sequence promoters? (3) Do the results suggest a mechanism for promoter evolution?

Experimental Procedures

Promoter sequences

Promoter sequences were obtained from RegulonDB. The RegulonDB database¹¹ (http://RegulonDB.ccg.unam.mx/) is the primary reference database for the transcriptional regulatory network of Escherichia coli K-12 (substr. MG1655, GenBank ref. seq. NC_000913.2, GI: 49175990). Predictions are anchored by experimental evidence on the location of transcription start sites determined by RegulonDB using a modified 5'RACE procedure.

Predicted promoter data files (accessed 5.26.09) contained the base sequence of both boxes (–35 and –10 boxes) and the size in bp of the intervening spacer region, along with promoter positions in the genome. We studied three regulons in detail: σ²⁴ (799 genes), σ²⁸ (122 genes), σ⁵⁴ (151 genes). The large housekeeping regulon σ⁷⁰ (4010 genes) was added later in the study. Base sequence information included: σ²⁴ and σ⁵⁴, 11 bp (6 bp of –35 box, 5 bp of –10 box); σ²⁸, 15 bp (7 and 8 bp, respectively); and σ⁷⁰, 17 bp (9 and 8 bp, respectively). Alignments used were as provided by RegulonDB.

Power-law scaling of promoter abundances

We used Perl script to survey the E. coli K-12 genome and assess the abundance of the predicted promoter motifs along with their inferred consensus sequence for each regulon. These distributions were evaluated for their fit relative to a Pareto distribution¹² using Matlab. For this purpose we evaluated F_c for each graph, the complementary cumulative distribution function (ccdf), which is a monotonically non-increasing function describing the probability that a random variable takes a value greater than x:

F_{c} (x) > P (X > x) = 1 - c d f (x) \sim {(\frac{x}{x_{m}})}^{- α}

where cdf is the standard cumulative distribution function, and x_m is the minimum value taken by x. In the evaluation of promoter frequencies in the genome, we used both a measure of goodness of fit (R²) and an estimate of the scaling coefficient (γ). After taking logs of both sides, α was obtained as the slope:

\log (y) \sim - α (\log x - \log x_{m})

and the scaling exponent as γ = α + 1 such that

P (x) = x^{- γ}

Predicted promoter networks

Sequence and spacer information were used to calculate A_ij, the number of bp shared between promoter sequences i and j. A gap penalty (–1 per bp) was applied for mismatches in spacer sizes in the RegulonDB alignments. These weighted edge test. This question is discussed in, A, which was used to construct a network, or graph G. Networks were visualized using Pajek¹³ and the Kamada-Kawai¹⁴ projection. Networks were analyzed using script written in Python that utilized NumPy, SciPy, and NetworkX,¹⁵ an open source Python package for the analysis of complex networks (http://networkx.lanl.gov/).

Random promoter networks

Random promoter networks were generated for Monte Carlo tests by forming a set of n promoters, each through B random draws from a uniform base distribution (A, C, G, T). We considered the three RegulonDB systems, σ²⁴, σ²⁸, and σ⁵⁴, with promoter numbers n and footprint sizes B as noted above. The size of the spacer separating the –10 and –35 boxes was randomly drawn from the distribution of sizes in the relevant data set. Random promoter networks were then produced in the same fashion as with the predicted promoter networks.

Network extractions using thresholding

Subgraphs were extracted using serial thresholding, or affiliation-based extraction,¹⁶ performed as follows. For m-slices, we sequentially removed all edges from graph G below a sliding critical integer threshold m (1 < m < B), where B was the maximum number of bp in the promoter sequence. For x-sections, we used discrete intervals based on the same sliding scale of integer threshold values, removing edges above and below that value of x. At each step, we then extracted the largest (maximal) connected component (LCC), the largest set of nodes that remain interconnected after selective edge removal from G. For each LCC, the number of nodes (graph size) and number of edges were evaluated. LCC that retained at least half of the nodes in graph G were giant components, by definition.

Monte Carlo tests

We used Monte Carlo randomizations to compare the node and edge counts of the LCCs obtained from the predicted promoter networks with their random counterparts through a series of x-sections. Bonferroni corrections to a were used for the multiple tests within a regulon (tests were performed only on LCCs of size n *** 5). Each replicate involved the production of a random promoter network from which a series of x-sections were extracted, and the node and edge counts appraised and stored. The replicate stored counts were used to form 95% confidence intervals wherein the observed data value was treated as if drawn from a distribution with at least r = 1,000 replicates (σ²⁴ and σ⁵⁴, r = 1,320; σ²⁸, r = 1,800).

Estimating the fractal dimension

Song et al¹⁷ showed how to measure the fractal dimension in a network by implementing the standard box covering method as a network coloring problem. In brief, for a given box length (l_B) or shortest path length between nodes, each node is colored in a fashion such that neighbors of like color are no further away than that current box length. Then the network is renormalized by collapsing adjacent nodes into a single node if they share the same color. Considering a range of box lengths, where each determines the renormalized node count, or graph size n, a plot of l_Bversus n on a log-log scale will be linear for networks with a fractal topology. Python script was written (using NetworkX) to implement this renormalization method. The fractal dimension d_B is obtained from linear regression of the log-log transformation of the general scaling relation:

\frac{N_{B} (l_{B})}{N} \sim l_{B}^{- d_{B}}

Results and Discussion

Power-law scaling of promoter abundance

Consensus sequence promoter motifs were not present in the predicted promoter sets from RegulonDB, and were rare or absent in the E. coli K-12 genome, as noted elsewhere.² Of the regulons we examined, only the inferred consensus for σ²⁸ occurred in the genome (three copies).

A subsequent survey of the full predicted promoter sets against the E. coli K-12 genome revealed that promoter abundances approximated a power-law. Log-log plots of the complementary cumulative distribution functions (ccdf) for promoter motif counts are shown (Fig. 1). We included the large σ⁷⁰ regulon and, generally, sets with more promoters gave a better fit to a power function. Power-law scaling has been described before for gene frequencies within and across genomes and often attributed to gene and genome duplication events.¹⁸

Figure 1

Promoter frequencies in genomes: Log-log plots of complementary cumulative distribution functions for occurrences of promoter motifs in the full genome: σ²⁸ (n = 122 genes, α = 0.300, R² = 0.615), σ⁵⁴ (n = 151, α = 1.925, R² = 0.819), σ²⁴ (n = 799, α = 2.567, R² = 0.907), and σ⁷⁰ (n = 4010, α = 1.704, R² = 0.935).

These findings support the growing view that consensus and canonical promoter motifs generally play an indirect role in genome evolution. That they rarely participate directly in transcription has been attributed to the fact that they bind σ too firmly, preventing promoter clearance and elongation, and that there is functionality in a weak promoter that can be modulated with compensatory regulators.^{1,2,4,19–21} And in many cases promoters appear to be chimeric combinations of canonical and non-canonical binding sites.^1,22 supporting the view that ‘perfect promoters are not biologically relevant’.¹ We accept this sentiment insofar as it conveys the fact that consensus promoters actually perform little of the transcriptional work in the cell. We nuance this perspective by suggesting that the ideal consensus promoter represents the optimal DNA-protein binding chemistry and therefore serves as an organizing principle for the evolution of the transcriptional grammar and of the resultant topologies seen in the promoter networks described in this study.

Phase transitions in promoter networks

Serial extractions revealed phase transitions in the promoter networks (Fig. 2) at positions predicted by percolation theory (Fig. 3). The unreduced promoter networks were highly dense (> 0.999), occluded by numerous weak edges representing the sharing of few bases. Thresholding provided targeted windows of lowered edge density through which we examined attributes of the LCCs.

Figure 2

Largest connected components following extractions of x-section by thresholding of three E. coli regulons. Each promoter network was broken into subgraphs based on edge weights using a series of integer threshold values (X_i), shown along the top of the figure). An x-section retained only edges of weight x = X_i bp-sharing between promoter sequences (nodes) (every other step is shown in the figure).

Figure 3

x-sectional profiles of number of nodes and edges for predicted promoters (lines) along with 95% confidence intervals (CIs, shaded regions). CIs were based on Monte Carlo simulations of random promoter networks built from sets of promoters of random base sequence, each with footprint and spacer attributes drawn from a predicted promoter set. Whereas predicted promoters were in close juxtaposition in promoter space (sharing ~7–8 bp out of 11 or 15), random promoter networks had significantly more diffuse footprints (2–4 bp shared) consistent with binomial expectations. Vertical dashed lines mark the phase transitions predicted by percolation theory.

A phase transition is an abrupt change in the state of a system associated with incremental change in a system parameter, such as the shift with temperature between liquid and gas phases described by van der Waals.²³ In networks, as edges are added (removed) randomly to a graph, there is a sudden increase (decrease) in global connectivity with emergence (fracture) of a giant component, a connected component containing at least half of the nodes.²⁴ In a random graph of n nodes, this occurs predictably around the percolation threshold p_c = 1/n.

In Figure 3, we indicate the positions of the phase transitions expected from percolation theory in our plots of node and edge numbers. In each case, an expected phase transition is marked as a vertical dashed line positioned at the edge density p_c = 1/n. The resulting alignment of these positions with the observed phase transitions in node counts is taken as evidence of concurrence with theory. With σ⁵⁴ as an example, n = 151 promoters yields a percolation threshold p_c of ~0.0066. Of the 11,325 possible edges, this translates into ~75 edges. Though our discrete categories are coarse, this is roughly where we observed the formation of the largest connected component in our x-sectional profile: between 3–4 bp shared, the number of edges changed from 4 to 340, and largest connected component size jumped from 5 to 128 nodes.

Topology of promoter networks

Whereas the LCCs from lower thresholds were fairly homogeneous and dense, containing numerous edges representing low-value bp-sharing, the LCCs emerging from the upper phase transition displayed considerable structural complexity. These network nuclei represent a significantly constrained limiting similarity among promoters as they contain information on high levels of bp-sharing among many of the promoters in the regulon. Monte Carlo tests showed that LCCs built from RegulonDB promoter sets contained significantly higher-valued edge weights than those of random promoter networks (Fig. 3).

The network nuclei have a fractal topology, as implied by their self-similar appearance (Fig. 4). LCCs captured from the upper phase transition were evaluated using the method of Song et al¹⁷ who showed how to measure the fractal dimension of a network by implementing the standard box covering method as a network coloring problem. In the regulons we examined, the average fractal dimension was d_B = 1.731 (Fig. 5). This has the biological interpretation that a unit increase in the log of the box length (modular extent of promoter sequence similarity) is met with a 1.731-fold decline in the log of the graph size (number of nodes). It is noteworthy that the weakest fractal structure was displayed by σ²⁸ which was the regulon whose consensus sequence appeared in the genome.

Figure 4

Fractal nuclei of the four regulons captured at upper phase transitions. A) σ²⁴, d_B = 1.492; B) σ⁷⁰, 1.911; C) σ²⁸, 1.929; D) σ⁵⁴, 1.590. Promoter abundance in the E. coli K-12 genome is shown as node size variation. The consensus sequence (orange node) for a²⁸ occurred in the genome, others did not and are included for heuristic purposes. Networks were rendered using Pajek.¹³

Figure 5

Fractal analysis of the upper phase transition nucleus for the four E. coli σ regulons: Log-log plots of box number (l_B) versus graph size (normalized number of nodes) for LCC. Fractal dimensions (mean d_B = 1.731) and coefficients of determination (mean R² = 0.957): σ²⁸ (d_B = 1.929, R² = 0.949), σ⁵⁴ (d_B = 1.590, R² = 0.959), σ²⁴ (d_B = 1.492, R² = 0.978), and σ⁷⁰ (d_B = 1.911, R² = 0.943).

Regulons with a highly fractal nucleus did not utilize their consensus promoter in the E. coli genome.

DLA model of promoter evolution

These findings, including the mean fractal dimension of d_B = 1.731, suggested a specific repulsive mechanism for development of a fractal nucleus in a promoter network. A dimension of d = 1.7 is typical of fractals arising by diffusion-limited aggregation (DLA).²⁵ In the general 2-d model, particles diffuse randomly as a Brownian motion, occasionally sticking to a growing cluster. Growth is through preferential attachment, but not to the oldest particles as in a scale-free model of network growth.²⁶ Instead, particles attach preferentially to the growing arms of the cluster since the arms increasingly obstruct access to the central region. It appears as though the center repulses any new additions.

A promoter network growing by DLA would be regulated by both repulsive and attractive forces, mediated on the micro-scale through DNA-protein binding chemistry, and on the macro-scale by population-level fitnesses, all organized around the consensus promoter. The consensus would form an attractor in transcriptional promoter networks because it represents the optimal binding chemistry for a, and departures from the consensus would weaken and eventually eliminate this binding capacity.⁴ Yet it appears that the consensus and canonical motifs rarely participate directly in transcription perhaps because they bind a too firmly.^2,4,19–21 The resulting lowered population-level fitness would repulse additions from the network center.

These dynamics are analogous to the interatomic attractive and repulsive forces that include the van der Waals interactions.²³ Our interpretation comports with the recent generalization that repulsion is a critical prerequisite to fractal development in most complex networks.^27,28

Concluding Remarks

Our results suggest a link between the development of scaling relations in genome structure and function. This correspondence is in part anticipated by the Zipf-Mandelbrot law,^29,30 though genome work to date has emphasized frequency (structural) scaling without integrating topological (functional) scaling.

Disclosures

This manuscript has been read and approved by all authors. This paper is unique and not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

Footnotes

Acknowledgements

Thanks to J. Collado-Vides and RegulonDB for use of their promoter resources; NetworkX for its computational tools; C. Song and colleagues for their fractal method; J. Nadolski and L.A. Smith for comments; T. Mikesell for Perl advice; and Benedictine University for computer resources.

References

Hook-Barnard

I.G.

, Hinton

D.M.

Transcription initiation by mix and match elements: Flexibility for polymerase binding to bacterial promoters. Gene Regul Syst Bio. 2007; 1: 275–93.

Huerta

, Collado-Vides

Sigma70 promoters in Escherichia coli: Specific transcription in dense regions of overlapping promoter-like signals. J Mol Biol. 2003; 333(2): 261–78.

Ozoline

, Deev

, Arkhipova

Non-canonical sequence elements in the promoter structure. Cluster analysis of promoters recognized by Escherichia coli RNA polymerase. Nucleic Acids Res. 1997; 25(23): 4703–9.

Hawley

, McClure

Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res. 1983; 11(8): 2237–55.

Barabasi

A.L.

, Oltvai

Network biology: Understanding the cell's functional organization. Nat Rev Genet. 2004; 5(2): 101–13.

Albert

Scale-free networks in cell biology. J Cell Sci 2005; 118(21): 4947–57.

Dartigalongue

, Missiakas

, Raina

Characterization of the Escherichia coli sE regulon. J Biol Chem. 2001; 276(24): 20866–75.

H.W.

An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs. Nucleic Acids Res. 2004; 32(22): 6643–9.

Masuda

, Church

Regulatory network of acid resistance genes in Escherichia coli. Mol Microbiol. 2003; 48(3): 699–12.

10.

Janga

, Collado-Vides

Structure and evolution of gene regulatory networks in microbial genomes. Res Microbiol. 2007; 158(10): 787–94.

11.

Gama-Castro

, Jiménez-Jacinto

, Peralta-Gil

RegulonDB (version 6.0): Gene regulation model of Escherichia coli K-12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res. 2008; 36(Database issue): D120–4.

12.

Downey

Computational modeling and complexity science (Green Tea Press, Needham, MA, 2008).

13.

Batagelj

, Mrvar

Pajek—Program for Large Network Analysis. Connections. 1998; 21(2): 47–57.

14.

Kamada

, Kawai

An algorithm for drawing general undirected graphs. Inf Process Lett. 1989; 31(1): 7–15.

15.

Hagberg

, Schult

, Swart

Exploring network structure, dynamics, and function using NetworkX, presented at Proceedings of the 7th Python in Science Conference (SciPy2008), Pasadena, CA USA, 2008 (unpublished).

16.

de Nooy

, Mrvar

, Batagelj

Exploratory Social Network Analysis with Pa.ek (Cambridge University Press, Cambridge, 2005).

17.

Song

, Gallos

, Havlin

, Makse

How to calculate the fractal dimension of a complex network: The box covering algorithm. J Stat Mech. 2007: P03006.

18.

Luscombe

The dominance of the population by a selected few: Power-law behavior applies to a wide variety of genomic properties. Genome Biol. 2002; 3(8): research0040.

19.

Ellinger

, Behnke

, Bujard

, Gralla

Stalling of Escherichia coli RNA polymerase in the +6 to +12 region in vivo is associated with tight binding to consensus promoter elements. J Mol Biol. 1994; 239(4): 455–65.

20.

Grana

, Gardella

, Susskind

The effects of mutations in the ant promoter of phage P22 depend on context. Genetics. 1988; 120(2): 319–27.

21.

Miroslavova

, Busby

Investigations of the modular structure of bacterial promoters. Biochem Soc Symp. 2006; 1–10.

22.

Ozoline

O.N.

, Deev

A.A.

, Arkhipova

M.V.

Non-canonical sequence elements in the promoter structure. Cluster analysis of promoters recognized by Escherichia coli RNA polymerase. Nucleic Acids Res. 1997; 25(23): 4703–9.

23.

van der Waals

Over de Continuiteit van den Gas- en Vloeistoftoestand (on the continuity of the gas and liquid state), Ph.D. thesis ed. (Leiden, The Netherlands, 1873).

24.

Erdös

, Rényi

On the evolution of random graphs. Publ Math Inst Hungar Acad Sci. 1960; 5: 17–61.

25.

Witten

T.A.

, Sander

L.M.

Diffusion-limited aggregation, a kinetic critical phenomenon. Phys Rev Lett. 1981; 47: 1400–3.

26.

Barabasi

A.L.

, Albert

Emergence of scaling in random networks. Science. 1999; 286(5439): 509–12.

27.

Song

, Havlin

, Makse

H.A.

Self-similarity of complex networks. Nature. 2005; 433(7024): 392–5.

28.

Chsmh

Song

. Origins of fractality in the growth of complex networks. Nature Physics. 2006; 2: 275–81.

29.

Zipf

Human Behavior and the Principle of Least-Effort (Addison-Wesley, Cambridge, MA, 1949).

30.

Mandelbrot

The Fractal Geometry of Nature (WH Freeman and Comp., New York, 1983).

Fractal topology of gene promoter networks at phase transitions

Abstract

Keywords

Introduction

Experimental Procedures

Promoter sequences

Power-law scaling of promoter abundances

Predicted promoter networks

Random promoter networks

Network extractions using thresholding

Monte Carlo tests

Estimating the fractal dimension

Results and Discussion

Power-law scaling of promoter abundance

Phase transitions in promoter networks

Topology of promoter networks

DLA model of promoter evolution

Concluding Remarks

Disclosures

Footnotes

Acknowledgements

References