Academia.eduAcademia.edu
01010101010 010101ATCGATCG 0101010101010101001011 010101ATCGATCG 01010101010101010 010101ATCGATCG 010101ATCGATCG 0101010101010101001 010101ATCGATCG 101100101010101 010101ATCGATCG Data Mining of the Coffee Rust Genome David Botero-Rozo, William Giraldo, Álvaro Gaitán, Marco Cristancho, Diego M. Riaño-Pachón & Silvia Restrepo. E-Mail: URLs: do.botero29@uniandes.edu.co http://lamfu.uniandes.edu.co/ http://bce.uniandes.edu.co/ http://bioinformatics.cenicafe.org/index.php/wiki Nature Precedings : doi:10.1038/npre.2012.7034.1 : Posted 27 Mar 2012 WHY DO WE STUDY COFFEE RUST? Coffee leaf rust (caused by the fungus Hemileia vastatrix) is the most limiting disease wherever coffee is cultivated. In Colombia, coffee represents 16% of the country’s agricultural GDP and since 2008 high incidence of coffee leaf rust in crops established with susceptible varieties has caused significant reduction in yield. Genome and transcriptome sequencing and bioinformatics analysis tools have been applied for the understanding of this organism and its interaction with the plant and the environmental variations that result in epidemics. WHAT DID WE DO? We sampled isolates from field crops and collected eight different races of H. vastatrix. One isolate per race was sequenced by Illumina and 454 technologies. The data was subjected to pre-quality control with FastQC and was cleaned. After testing several assemblers with different combinations of data, we decided to do a hybrid assembly using the CLC assembler. The assembly was analyzed with MEGAN to evaluate the level of contamination and do a first approximation to the biological communities associated to H. vastatrix on the coffee leaf. BLAST was used to search the H. vastatrix mitochondria, comparWHAT DID WE GET? ing our contigs against the Puccinia mitochondrial genome (the closA total of 73GB of NGS data was generated. An assembly of 396.264 contigs est sequenced organism to H. vastatrix). The mitochondrial contigs (N50 of 1590 and 841 of mean length) was obtained; this assembly is highly identified were annotated with MAKER. Finally, using Trinity we asfragmented. Nevertheless, we obtained very large contigs (the largest of sembled three transcriptomes from different races of H. vastatrix. 85Kb and coverage of 148x). After filtering out contigs with putative contaminants with MEGAN (coffee and bacterial sequences that could be into This is the first approach to study the genome of the causal agent of the rust samples), we obtained 31.376 contigs that showed similarities to the coffee rust, H. vastatrix. Further data mining will allow the identifireported fungal sequences. Forty four putative mitochondrial contigs were cation of virulence and aggressiveness factors, important in the charidentified through blast homologies using Puccinia genome sequences. We acterization of new races of the pathogen and the detection of isolates also assembled three transcriptomes with a length of 55,791 and 64,752 that might infect resistant coffee varieties. contigs for non-normalized libraries and 44,297 contigs for a normalized library. Cyanobacteria 2 Oscillatoriales 8 Firmicutes 6 Bacilli 5 Bacteria 6 Alphaproteobacteria 5 Proteobacteria 2 Gammaproteobacteria 7 Apicomplexa 9 Alveolata 1 Ciliophora 6 cellular organisms 311 Amoebozoa 1 root 199 Eukaryota 5918 Mycetozoa 21 Fungi/Metazoa group 2078 Fungi 7272 Metazoa 1578 Trichomonadida 15 stramenopiles 5 Figure 1. Data cleaning. Here the process of cleaning is illustrated. Top: Every cleaning step is shown. Middle and bottom: Illumina and 454 inputs and outputs reads given in million of reads and fraction of reads with references to the last step. Note that Illumina reads were not subjected to seq-clean tool (too short for low complexity filtering). The 454 reads were not subjected to duplication reads removal. Both data sets were subjected to low complexity masking. Oomycetes 138 Chlorophyta 21 Viridiplantae 10 Streptophyta 1086 No hits 30489 Figure 4. Distribution of number of reads in contigs. Most contigs have few reads. The assembly is very fragmented. Figure 7. Taxonomy distribution of data (Megan - Reduced View). We ran blast of all contigs from the third assembly against nr database. Results were loaded in Megan to filter out contaminants: Viridiplantae and Bacteria. We found that almost all contigs hit a fungi sequence, showing that the H. vastatrix sequence data had very little contamination from other organisms. Bacteria Cyanobacteria Firmicutes Proteobacteria Alveolata Amoebozoa Alternaria alternata Aspergillus nidulans FGSC A4 Figure 2. Assembly results. The results of five different assemblies are shown. The first and second hybrid assemblies differ by the duplicate removal and low complexity masking. In the third assembly we added two plates of 454. In the last assembly it is shown that including 454 data reduce the number of contigs and increase the contig size. The reads mapping results are shown: multi hit reads, potential pairs, not paired and successful pairs. Trichocomaceae mitosporic Trichocomaceae Penicillium Eurotiomycetidae Aspergillus Penicillium chrysogenum Wisconsin 54-1255 Penicillium marneffei ATCC 18224 leotiomyceta Pezizomycotina Talaromyces stipitatus ATCC 10500 Onygenales saccharomyceta Ajellomyces capsulatus NAm1 Coccidioides immitis RS Sclerotiniaceae sordariomyceta Botryotinia fuckeliana B05.10 Sclerotinia sclerotiorum Sordariomycetidae Magnaporthe oryzae 70-15 Podospora anserina Figure 5. Distribution of contig size. Most contigs are very small. Dikarya cellular organisms Fungi Fungi/Metazoa group root Eukaryota Ascobolus immersus Saccharomycetales Candida albicans SC5314 Saccharomyces cerevisiae S288c Agaricomycetes incertae sedis Phanerochaete chrysosporium RP-78 Postia placenta Mad-698-R Agaricomycetes Agaricomycotina Moniliophthora perniciosa FA553 Agaricales Coprinopsis cinerea okayama7#130 Schizophyllum commune H4-8 Laccaria bicolor S238N-H82 Basidiomycota Filobasidiella Cryptococcus neoformans var. neoformans Filobasidiella depauperata Cryptococcus neoformans var. neoformans B-3501A Cryptococcus neoformans var. neoformans JEC21 Pucciniales Melampsora lini Phakopsora pachyrhizi Uromyces viciae-fabae Ustilago maydis 521 Orpinomyces sp. OUS1 Metazoa Parabasalia Phytophthora infestans T30-4 Viridiplantae No hits Figure 3. Contigs size fraction of coverage. Results of coverage for the last assembly. Low means coverage smaller than 5x, Good is coverage between 5x and 45x, High is coverage between 45x and 100x and Ups is coverage higher than 100x, these categories are quite arbitrary. The bins for the contig size are 150bp wide. The average coverage of a considerable fraction of the contigs of this assembly is acceptable (those with High and Good values in the Figure). Figure 6. Distribution of coverage per contig. Most contigs have a reasonable coverage value. Figure 8. Taxonomy distribution of data (Megan - Expanded View). References AJJAMADA C. KUSHALAPPA & ALBERTUS B. ESKES. Coffee Rust: Epidemiology, Resistance and Management. 1989. BRANDI L. CANTAREL, IAN KORF, SOFIA M.C. ROBB, GENIS PARRA, ERIC ROSS, BARRY MOORE, CARSON HOLT, ALEJANDRO SÁNCHEZ ALVARADO, AND MARK YANDELL. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008 January; 18(1): 188–196. Acknowledgment We greatly acknowledge funding from Faculty of Sciences of the Universidad de los Andes through the program Funding Support for Assistant Graduate, to Ministerio de Agricultura and Colciencias for funding support for all the project. HUSON DANIEL H., AUCH ALEXANDER F., JI QI AND SCHUSTER STEPHAN C. MEGAN analysis of metagenomic data. Genome Res. 2007; January; 17: 377-386. MICHAEL C. SCHATZ, ARTHUR L. DELCHER AND STEVEN SALZBERG. Assembly of large genomes using second-generation sequencing. Genome Res. May 27, 2010. Gerencia técnica, Programa de investigación científica. Centro Nacional de Investigaciones del Café “Pedro Uribe Mejía”. Boletín técnico Nº19, Cenicafé, recomendaciones para el manejo de la roya del cafeto en Colombia.1999.