Academia.eduAcademia.edu
Initial Large-scale Exploration of Protein-protein Interactions in Human Brain Jake Y. Chen, Andrey Y. Sivachenko, Russell Bell, Cornelia Kurschner, Irene Ota, and Sudhir Sahasrabudhe Myriad Proteomics, Inc., 2150 W. Dauntless Ave, Salt Lake City, UT 84116, USA. {jchen, asivache, rbell, ckurschn, ota, sudhir}@myriad-proteomics.com Abstract Study of protein interaction networks is crucial to post-genomic systems biology. Aided by highthroughput screening technologies, biologists are rapidly accumulating protein-protein interaction data. Using a random yeast two-hybrid (R2H) process, we have performed large-scale yeast two-hybrid searches with approximately fifty thousand random human brain cDNA bait fragments against a human brain cDNA prey fragment library. From these searches, we have identified 13,656 unique protein-protein interaction pairs involving 4,473 distinct known human loci. In this paper, we have performed our initial characterization of the protein interaction network in human brain tissue. We have classified and characterized all identified interactions based on Gene Ontology (GO) annotation of interacting loci. We have also described the “scale-free” topological structure of the network. 1. Introduction Development of high-throughput screening technologies and the completion of a number of genome sequencing projects have led to the emergence of post-genomic systems biology. In this new era, biologists are gaining increasing interest in analyzing system-wide biological data, including whole-genome sequence data, cross-genome homolog information, global gene expression profiles, transcriptional regulatory networks, and protein interaction networks. The system-scale analysis of biological data can provide traditional biologists with new perspectives to understand functions of interconnected genes and proteins in their complex cellular and molecular context. We characterize systems biology studies as following a paradigm that consists of iterative cycles of two distinct but related stages. The first stage is a data-driven “bottom-up” knowledge discovery process. Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE At this stage, computational scientists sift through large volumes of primary biological data to reveal high-order “patterns” or “models” that can characterize the underlying data. The second stage is a hypothesis-driven “top-down” knowledge discovery process. At this stage, computational biologists and biologists team up to infer and validate previously unknown details of biological processes or molecular functions as guided by global perspectives of highorder “models”. At Myriad Proteomics, we are undertaking a systemscale human protein-protein interaction discovery project, which involves high-throughput yeast twohybrid screening, tandem-affinity purification mass spectrometry analysis, robotics, software engineering, data-driven bioinformatics, and target-driven drug discovery activities [1]. The scope of our efforts are unprecedented, since all other reported protein interaction projects so far have concentrated either on small non-mammalian organisms [9, 12, 14] or on small regional networks [7]. We believe that collecting and analyzing such data will help us understand protein function and molecular signaling networks leading to drug target discoveries [3]. In this short paper, we describe high-level characteristics of the human brain protein interaction data and their subnetworks, which have been collected and compiled from random yeast-2-hybrid (R2H) high-throughput process. Specifically, we describe general statistical properties of our data set, present visualization of the structure of the data, and examine the topology of the interaction network. 2. Methods In this study, we have based our analysis on data generated from a “random yeast two-hybrid” (R2H) system. For a comprehensive review of related experimental and computational data generation techniques, refer to [15, 18]. Compared with the standard directed yeast two-hybrid system [5], the R2H system makes the following two distinctions. First, we prepare bait cDNA libraries using randomly fragmented cDNAs. Our aim is to generate cDNA insert sizes averaging 800 to 900 base pairs, which produce small hybrid proteins that facilitate proteinprotein interactions. Second, we prepare a mixed yeast bait clone library in the beginning of the process, but then clonally isolate the individual baits and separately perform the mating of one yeast bait clone at a time against a whole yeast prey clone library (which we call a “R2H search”). Therefore, we can perform many R2H searches of anonymous bait clones in parallel, and leave the task of revealing the identity of interacting pairs from positive clones to the subsequent sequencing of both the bait and prey clone cDNA inserts. In all, the R2H system, in conjunction with our extensive use of automation and array-format (96x and 384x formats), enables us to accommodate a throughput of approximately 6,000 R2H searches per week. We have designed and implemented a Laboratory Information Management System (LIMS) to manage the R2H data collection process, and a database platform based on data modeling design principles described in [6] to mine and explore our interaction data set. Our major bioinformatics data preparation steps include: collecting primary experimental data through LIMS, performing base-calling, cleaning sequences and clipping vector regions using CrossMatch, assembling sequencing reads from both 5’-end and 3’-end of the vector inserts using CAP3, identifying the assembled sequence insert by performing BLAST against the NCBI REFSEQ database, and annotating sequences using imported databases such as LocusLink and Gene Ontology [2, 11]. We have collected and analyzed interactome data involving more than 50,000 R2H searches against a prey cDNA library from expressed mRNAs in homogenized human brain. From these R2H searches, we have created protein-protein interaction data set for this study, which contains 13,656 uniquely identified binary protein-protein interaction pairs involving 4,473 distinct protein loci. We have performed data analysis and visualization by using a combination of software tools, including Oracle9i, S-Plus Analytic Server 2, and Spotfire DecisionSite Browser 7.1. 3. Results We collected the statistics of a subset of our interaction data that comprises approximately 50,000 R2H searches. We determined three major characteristics of the data. First, we calculated search positive rate, which we define as the number of Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE 160 140 120 100 80 60 40 20 0 24148 24774 26485 27617 28532 SEARCHID 36250 37511 39413 Figure 1. A scatter plot of “picked positive count” for all the ~17,000 non-null R2H searches performed on human brain libraries. The numbers along x axis are unique numbers, SEARCHID, identifying each R2H search. Pick_pos_count numbers along y axis are the count of positive clones picked in a search. Note the horizontal line of pick_pos_count = 48 indicated that at one decision point, we decided to pick at most 48 positive clones from each positive search. positive clones either “observed” or “observed-andpicked” per search. We found that ~33% of searches generated between 1 and 100 positive clones, ~2% generated more than 100 positive clones, and the remaining ~65% generated no positive clones (null searches). Thus, there were 17,371 searches (~90% of them contained unique baits) that gave rise to positive clones. On average, we observed a search positive rate of approximately three while counting all null searches, or approximately eight to nine while not counting any null searches. In Figure 1, we showed a plot of “observed-and-picked” (or “picked” in short) search positive counts for all of the non-null searches. Second, we calculated interaction discovery rate, which we define as the number of unique interaction pairs per search. We accumulated a total of 13,660 unique interaction pairs from 12,808 searches where both bait and prey loci have been identified. Among these 13,660 unique interaction pairs, 12,466 (~90%) were observed at least once within only one R2H search and 8,501 (~62%) were observed only once within only one R2H search. The high percentage (~90%) of novel searches suggested potentially high search efficiency of our process; the repeated detection (1-62%=38%) of the same interaction pairs, on the other hand, enabled us to infer system errors of our R2H search process. While monitoring the accumulation of unique interaction pairs throughout the 17,371 non-null searches, we noticed a relatively stable “interaction discovery rate” of approximately 1.1 over the entire search period. This suggested that we might not have saturated all the interactions in the constructed human brain libraries. Third and last, we calculated the cDNA fragment size for bait and prey constructs. Our bait and prey clone cDNA fragment insert size for all the positive clones followed a normal distribution with a mean of ~900bp and a standard deviation of ~250bp. This suggested that our human protein fragments enlisted into the interaction events were large enough to accommodate the majority of documented protein domains, 90% of which should be within 300 amino acid residues long [16]. We applied information visualization techniques, which enabled us to explore high-level data patterns and to follow up with queries of the underlying data. Prey In Figure 2 we plotted a two-dimensional “heat map” (zoomed in to show a data subset consisting of 3,392 bait-prey protein interaction pairs) created within the Spotfire DecisionSite Browser. The heat map within the Spotfire software displayed protein-protein interaction pairs and, upon user selection of a set of data points on the plot, could display interacting details on a side panel (not shown). With this heat map, we could quickly detect “promiscuous interacting 1600 1400 Arrestin, ARRB1 1200 1000 800 600 400 400 600 800 1000 1200 1400 Bait Figure 2. An interaction heat map showing interactions (dots) between a set of 1,200 bait proteins and a set of 1,600 prey proteins. The numbers along x and y dimensions are arbitrarily assigned protein identity labels (not REFSEQ accession numbers). The size and color intensity of a dot at (x, y) in the plot represent, in logarithmic proportion, the observed interaction frequency between bait protein x and prey protein y. Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE proteins” (proteins that tend to unselectively interact with many other protein partners) visually as either vertical lines or horizontal lines on the plot, and examine detailed protein descriptions by clicking on the dot representing a particular interaction pair. A vertical line often suggests that the bait involved is a “self activator”, a bait protein that can activate Y2H transcription without requiring a specific interacting prey protein. A horizontal line, on the other hand, suggests that the prey involved is a “false positive” or “sticky prey”, a prey protein that can engage in a wide spectrum of bait-prey interactions that activate Y2H transcription unselectively. For example, arrestin ARRB1 (pointed to by an arrow in the plot) showed a horizontal line pattern, suggesting that it might be a “false positive prey” in our R2H system. By browsing through ARRB1 interacting proteins, however, we also found that majority of the characterized observed interaction partners of ARRB1 were transmembrane receptors in accordance with the well established role of ARRB1 in dampening activated G-protein coupled receptor signals in cells [8]. Therefore, by providing biologists with a global interacting data set and a tool to “drill down” to the data details, we reap the benefits of our investment in “systems biology”. We showed three high-level classifications of protein interaction pairs using gene ontology (GO) categories in Figures 3a-c [2]. Figure 3a showed a categorized “heat map” view of 9055 unique protein-protein interaction pairs (13,660 identified interaction pairs, minus 4605 pairs, in which neither the bait nor the prey have GO molecular function annotation available) aggregated into 17x16 bins corresponding to pairs of high-level GO molecular function terms (including 16 molecular function categories and an additional “unknown” category for uncharacterized proteins* ). To accomplish this, we annotated all the proteins by tracing their individual original GO annotation terms back to the “ancestor” terms at the fixed level in the GO hierarchy. We made two observations from this visualization. First, we observed diverse but non-uniform and non-diagonal patterns in the categorized protein-protein interactions. Earlier studies (see, e.g. [10]) assumed that a diagonal pattern (i.e., proteins within the same category interacting with each other) was expected for most protein interactions found in the literature. However, we believe that this pattern should be difficult to observe due to a large percentage of unknown proteins and many proteins with multiple/incomplete GO category assignments. Besides, certain cross-category * Note that the Figure 3a-c shows individual counts of pairs falling into each bin, and due to multiple annotations available for many loci, the total sum over all the bins exceeds the number of interacting pairs. y TR TnslR TscrR SM ST PT MO LB ER EZ DP CR CH CA AR p ( ) SF Ribo PT PlsM PerM OutM Nu MitM MF IntM IF InnM ExtM EndoM CY CO CH CCort CalcC BM AO Figure 3a. Observed interactions binned into pairs of GO annotation terms in Molecular Function category at level 2 in the GO hierarchy. The size and color intensity of the squares are proportional to the logarithm of the number of interactions falling into the given term pair (the lowest and the largest numbers being 1 and 3946 respectively). The axis labels are: (AO) antioxidant, (AR) apoptosis regulator, (CA) cell adhesion molecule, (CH) chaperone, (CR) cytoskeletal regulator, (DP) defense/immunity protein, (EZ) enzyme, (ER) enzyme regulator, (LB) ligand binding or carrier, (MO) motor, (PT) protein tagging, (ST) signal transducer, (SM) structural molecule, (TscrR) transcription regulator, (TnslR) translation regulator, (TR) transporter, and (U) molecular function unknown. Figure 3b. Observed interactions binned into pairs of GO annotation terms in Cellular Component category at level 5 in the GO hierarchy. The axis labels are: (BM) basement membrane, (CalcC) calcineurin complex, (CCort) cell cortex, (CH) chromosome, (CO) collagen, (CY) cytoplasm, (EndoM) endomembrane system, (ExtM) extrinsic membrane protein, (InnM) inner membrane, (IF) insoluble fraction, (IntM) integral membrane protein, (MF) membrane fraction), (MitM) mitochondrial membrane, (Nu) nucleus, (OutM) outer membrane, (PerM) peroxisomal membrane, (PlsM) plasma membrane, (PT) proton-transporting ATP synthase complex, (Ribo) ribonucleoprotein complex, (SF) soluble fraction, and (U) cellular localization unknown. interactions are biologically plausible. For example, we did observe several significant and biologically interesting cross-category interaction patterns such as “enzyme—ligand binding molecule” interactions and “signal transducer—structural molecule” interactions. Our interaction data was also heavily concentrated around two functional categories of proteins—5,887 distinct interactions involving at least one ligand binding protein and 2,261 distinct interactions involving at least one enzyme—with each category interacting with proteins from all 16+1 GO categories. Second, we detected an opportunity to assess functions of previously uncharacterized proteins via their interactions. The figure showed that a total of 2,996 (67%) of all 4,474 observed distinct protein loci fell into the “unknown molecular function” category. However, 2,115 (71%) of these 2,996 proteins also interacted with at least one GO annotated protein. Overall, these “uncharacterized”-“characterized” protein interactions represented 49% (6,665 /13,660) of all unique interactions on the plot. This observation provides both an opportunity for inferring functions of uncharacterized proteins through their interaction context and a challenge for assessing the biological significance of the interaction pairs. Figures 3b and 3c show the same protein interaction pairs as Figure 3a categorized using high-level GO cellular component terms (Figure 3b) and biological process terms (Figure 3c). Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE Lastly, we investigated the topology of the proteinprotein interaction subnetwork derived from our data. In Figure 4a (also in the inset), we plotted the distribution P(k) of network “node degrees” k, where, for a protein (node) in the network, k represents the number of immediate interaction partners. We showed that the node degree distribution exhibited a power−γ law dependence, i.e., P ( k ) ∝ k (with γ ≈ 1.7 ). This distribution implies the existence of a large number of nodes with small node degrees, and a relatively slow decrease in the number of nodes with higher degrees. In contrast, as shown in Figure 4b, the P(k) distribution of a randomly constructed interaction network (by randomly choosing interacting pairs from Figure 3c. Observed interactions binned into pairs of GO annotation terms in Biological Process category at level 4 in the GO hierarchy. The axis labels are: (AC) actin cytoskeleton reorganization, (CA) cell adhesion, (CC) cell cycle, (CD) cell death, (CM) cell motility, (CO) cell organization and biogenesis, (CP) cell proliferation, (CR) cell recognition, (CSh) cell shape and cell size control, (CSig) cell-cell signaling, (Circ) circulation, (Emb) embryogenesis and morphogenesis, (Exc) excretion, (GE) genetic exchange, (Hom) homeostasis, (LM) learning and memory, (MF) membrane fusion, (Met) metabolism, (Onco) oncogenesis, (Patho) pathogenesis, (Pr) pregnancy, (Rep) reproduction, (Resp) response to external stimulus, (SD) sex determination, (ST) signal transduction, (SMT) small molecule transport, (SR) stress response, (TB) telomere binding, (Tr) transport, and (U) biological process unknown. a pool of available proteins) followed Poisson distribution. The power-law P(k) dependence is the key signature of “scale-free” networks, a type of network, to which the World Wide Web and many social networks belong, and which has been described only recently for certain metabolic networks and protein domain networks in biology [4, 13, 17]. A scale-free network topology implies the following properties: the network is neither completely modular, i.e. it cannot be separated into a set of independent subcomponents, nor completely random; it is highly robust against errors caused by disruption of a randomly chosen node, and yet it is highly vulnerable to perturbations of the small number of highly connected protein nodes known as “network hubs”. The “scale-free” network property of our data gave us two insights. First, since the power-law distribution P(k) has a long tail with some nodes having a large k and potentially providing “network hub” functions, we could no longer treat the problem of “promiscuous interacting proteins” simply by setting a fixed threshold k0 and discarding all the proteins with Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE log P(k) 3 2.8 2.6 2.4 2.2 2 1.8 0 0.2 0.4 0.6 0.8 1 (B) Node degree distribution of random network Relative frequency, P(k) Tr TB SR SMT ST SD Resp Rep Pr Patho Onco Met MF LM Hom GE Exc Emb Circ CSig CSh CR CP CO CM CD CC CA AC Relative frequency, P(k) (A) Observed node degree distribution Figure 4. Node degree distribution P(k) of (A) observed protein-protein interaction subnetwork (inset: the same distribution in log-log scale with best linear fit); (B) random network (see text). k > k0 . In fact, these few “network hubs” might provide important clues as to how different molecular signals are broadcasted and dampened within the cell. Second, we also believe that the power-law distribution of our network implies a low system error rate (defined as the rate of pairing proteins randomly). We plan to continue characterizing the human interactome data as more information accumulates. When combined with our data from tandem-affinity purification/mass spectrometry systems, and integrated with different types of genome, microarray, and disease pathway information, this data set will provide us with a detailed protein function roadmaps. Acknowledgement We thank Dr. Christopher Martin for his invaluable critical comments during our manuscript preparations. We thank Dr. Manuel Rodriguez, Dr. Robert Hughes, and Alan James for their support throughout the project. We also thank Hsiao-kun Tu, Manjula Aliminati, Amit Phansalkar, Dr. Hisayoshi Zaima, and Mitsuhiro Kanazawa for their assistance in preparing the data. References [1] Myriad Proteomics WWW Site, http://www.myriadproteomics.com/. 2002. [2] M. Ashburner, et al., "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium", Nat. Genet., 25(1), 2000, pp. 25-9. [3] D. Auerbach, et al., "The post-genomic era of interactive proteomics: facts and perspectives", Proteomics, 2(6), 2002, pp. 611-23. [4] A.L. Barabasi, and R. Albert, "Emergence of scaling in random networks", Science, 286(5439), 1999, pp. 50912. [5] P. Bartel, P. and S. Fields, eds. The Yeast Two-Hybrid System, Advances in Molecular Biology. Oxford University Press, 1997. [6] J.Y. Chen, and J.V. Carlis, "Genomic Data Modeling", Information Systems, 28(4), 2003, pp. 287-310. [7] B.L. Drees, et al., "A protein interaction map for cell polarity development", J. Cell. Biol., 154(3), 2001, pp. 549-71. [8] S.S. Ferguson, et al., "Molecular mechanisms of G protein-coupled receptor desensitization and resensitization", Life Sci., 62(17-18), 1998, pp. 1561-5. [9] T. Ito, et al., "Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins", Proc. Natl. Acad. Sci. U S A, 97(3), 2000, pp. 1143-7. [10] C.v. Mering, et al., "Comparative Assessment of Largescale Data Sets of Protein-protein Interactions", Nature 417, 2002, pp. 399-403. [11] K.D. Pruitt, and D.R. Maglott, "RefSeq and LocusLink: NCBI gene-centered resources", Nucleic Acids Res. 29(1), 2001, pp. 137-40. [12] J.C. Rain, et al., "The protein-protein interaction map of Helicobacter pylori", Nature, 409(6817), 2001, pp. 211-5. [13] E. Ravasz, et al., "Hierarchical organization of modularity in metabolic networks", Science, 297(5586), 2002, pp. 1551-5. [14] P. Uetz, et al., "A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae", Nature, 403(6770), 2000, pp. 623-7. [15] A. Valencia, and F. Pazos, "Computational methods for the prediction of protein interactions", Curr. Opin. Struct. Biol, 12(3), 2002, pp. 368-73. [16] S.J. Wheelan, A. Marchler-Bauer, and S.H. Bryant, "Domain size distributions can predict domain boundaries", Bioinformatics, 16(7), 2000, pp. 613-8. [17] S. Wuchty, "Scale-free behavior in protein domain networks", Mol. Biol. Evol., 18(9), 2001, pp. 1694-702. [18] M.L. Yarmush, and A. Jayaraman, "Advances in proteomic technologies", Annu. Rev. Biomed. Eng, 4, 2002, pp. 349-73. Proceedings of the Computational Systems Bioinformatics (CSB’03) 0-7695-2000-6/03 $17.00 © 2003 IEEE