Abstract
Background
Recent technological developments have made genome sequencing and assembly highly accessible and widely used. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, heterokaryosis, or extreme compositional biases can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a nonnegligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors.Findings
Here we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of nonchanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species.Conclusions
Our results show that 10 (28.57%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least for some lineages within the Fungi.Free full text
Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other nonstandard architectures in genome assemblies
Abstract
Background
Recent technological developments have made genome sequencing and assembly highly accessible and widely used. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, heterokaryosis, or extreme compositional biases can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a nonnegligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors.
Findings
Here we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of nonchanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species.
Conclusions
Our results show that 10 (28.57%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least for some lineages within the Fungi.
Findings
• We present Karyon, a Python-based bioinformatic pipeline that integrates genome assembly and a series of structural analyses for the diagnosis of problematic genomic structures. Karyon is freely available in GitHub and as a docker container (https://github.com/Gabaldonlab/karyon).
• We applied Karyon to 35 highly fragmented, publicly available genome assemblies to identify putative undescribed deviations in genomic architecture that might have caused problems in a standard assembly process. From 35 assemblies, 10 presented features that suggested possible underlying biological factors as the likely cause of the observed assembly fragmentation. Even though our sample size is small and restricted to a single lineage (Mucoromycotina), our results suggest that the number of unreported deviations in genome architecture in Fungi is considerable. This is emphasized if we consider that most researchers that have produced low-quality assemblies are unlikely to publish their data.
Introduction
Recent developments in high-throughput sequencing and bioinformatic tools have made the process of sequencing the genome of a new organism a routine task for many laboratories, especially those working on groups with small compact genomes (prokaryotes, fungi, many parasitic lineages). The success of a genome assembly is limited by technical aspects as well as by intrinsic properties of the sequenced genome. A successful assembly depends on the quality, design, and depth of the sequencing libraries, which must typically adapt to budget limitations. Naturally, if the sequencing methodology or the computational approaches are inappropriate, the resulting assembly will be poor (i.e., highly fragmented, incomplete, or misassembled). However, additional difficulties might arise independently of the methodology employed, due to intrinsic properties of the genome that interfere with genome assembly algorithms.
Biological factors affecting genome assembly quality
The main intrinsic factors that compromise the success of a genome assembly are the genome size, the sequence heterozygosity, the abundance of low-complexity regions (i.e., highly repetitive sequences), and the presence of high or uneven ploidy, contaminating sequences or extreme nucleotide compositions (Fig. 1).
Genome size impacts computational costs, as many assembly algorithms scale nonlinearly [1–3]. Heterozygosity implies the existence of allelic differences within an individual. Standard assembly algorithms have difficulty in differentiating between highly heterozygous regions and distinct but highly similar genomic regions [4, 5]. This, in turn, results in fragmented assemblies with inflated size compared to empirical measurements, as many of these regions appear duplicated [5], often in short scaffolds. This is particularly problematic in the case of individuals or a population that derived from sexual recombination between 2 or more distinct phylogenetic lineages (Fig. 1A), as the component subgenomes often develop structural rearrangements after the split of the 2 parental lineages. Similarly, repetitive or low-complexity genomic regions (Fig. 1B) are difficult to resolve without the aid of expensive experimental approaches (e.g., genetic maps, bacterial artificial chromosomes, long read sequencing techniques, or chromosome conformation capture), particularly when they span large genomic regions. Duplicated regions introduce multiple possible solutions to the process of scaffolding, increasing assembly fragmentation and computational costs [3, 4].
Similarly, ploidy deviations can greatly affect genome assembly. The first possible ploidy deviation is polyploidy (Fig. 1C), which is the presence of more than 2 chromosomes for the majority of the genome. Polyploidy is generally associated with genome heterozygosity, as it increases the number of possible states per site [6, 7]. For a diploid site, only 2 states are possible: heterozygous or homozygous, depending on whether the 2 alleles are different or equal, respectively. For a triploid, however, there are 2 possible heterozygotic states (e.g., AAB and ABB), and differentiating between them depends on relative frequencies. Allele frequency is affected by stochastic variation, specially if depth of sequencing is low. Aneuploidy (Fig. 1D) tends to cause the same problems as polyploidy in assemblies, albeit with the effect being limited only to the aneuploid regions. Because of this, genes present in chromosomes with ploidy higher than 2 will have a higher likelihood of being unannotated. Animal and plant genomics have traditionally considered aneuploidies as rare events, due to their deleterious effects on many of these organisms, especially during embryonic development [8]. This paradigm is clearly false for many fungal [9–12]) and protist [13, 14] lineages. Eukaryotic genomics has only recently started to focus on pangenomes [15–19], but aneuploidies might be an important confounding factor for these studies. For example, genes located in aneuploid regions are more likely to be missed in annotations, which can inflate estimations of presence/absence variation.
In syncytial organisms, such as filamentous fungi or slime molds, there is the possibility of coexistence of genetically different populations of nuclei within a cytoplasmic continuum, a condition known as heterokaryosis (Fig. 1E) [20–22]. Heterokaryosis is functionally similar to ploidy, although with some important differences. First, the relative proportions between heterozygous sites do not necessarily adjust to a simple fraction, as often one population is more abundant that the other. Second, since nuclei divide independently from each other, mitotic or meiotic recombination should be rare. This independence implies that any relative chromosomal rearrangements (i.e., duplications, deletions, translocations, and inversions) between the 2 nuclear populations, either pre- or postunion, would remain in nuclear populations for long periods of time. These rearrangements introduce the aforementioned complications in genome assemblies, and some of these might be difficult to differentiate from other chromosomal aberrations. A similar phenomenon is chimerism (Fig. 1F), in which the body of an organism is composed by 2 or more populations of genetically distinct cells. Certain lineages, especially colonial species, might arise by fusion of several genetically distinct individuals [23], but very little is known regarding the effect of chimerism in genome assemblies.
The presence of sequence contamination (Fig. 1G) can greatly compromise the quality of the genome assembly [24–28]. Extraneous sequences introduce noise, create chimeric contigs, and might introduce errors in k-mer estimations. Highly diverse contaminations (e.g., from the gut microbiota) introduce sequences with highly variable level of coverage, heterozygosity, and composition. On the other hand, highly abundant contaminants (e.g., symbiotic bacteria) are typically more homogeneous in all these parameters but might still form chimeric contigs and would indirectly reduce the depth of coverage in the main genome. Contaminations reducing the signal of the main genome are particularly problematic for single-cell sequencing projects [29, 30]. This is normally prevented by methodological means, but contaminating sequences are intrinsic for certain samples or even organisms, such as the case of symbiotic organisms (e.g., Lichens).
Finally, genomes with extreme compositions, typically very high or low GC content (GC%), can be difficult to assemble (Fig. 1H). For these genomes, the information contained by any AT positions is different from the information contained by a GC, as k-mers composed of the favored nucleotide pair will appear at higher frequencies. GC% has a well-documented effect on some sequencing technologies, most notably on the quality of Illumina reads [31, 32]. Fortunately, GC% is easy to measure from raw reads, and some genome assemblers include options specially adapted for these cases [33, 34]. Low GC% is typically associated with high abundance of low-complexity regions and transposable elements, but extreme GC% is also a hallmark of certain lineages, such as several groups of early diverging Fungi [35]. Despite their effects in genome analyses, GC% in eukaryotic genomes is often ignored. For example, neither NCBI nor Mycocosm report GC% in their assembly information statistics, unlike the Genome Online Database, which has a greater focus on prokaryotic sequences.
If the presence of the factors outlined above is anticipated, specific technical approaches—both experimental and computational— can be used. Contaminating DNA can be identified easily because sequencing coverage, nucleotide composition, and phylogenetic signal are usually different from the main genome, and several programs have been developed to identify contamination [24–28]. Ploidy can be estimated with cytogenetic techniques, which has been used for animals and plants since the 19th century. Unfortunately, cytogenetic techniques are time-consuming and difficult to interpret for some groups, such as the Fungi. Computational approaches exist to estimate composition and ploidy from sequencing reads [36–38]. Similarly, hybridization can be detected based on phenotypic traits (intermediate phenotypes and hybrid vigor). Again, this is not feasible for most microbial eukaryotes due to the lack of easily identifiable phenotypes. Genomes of hybrid organisms are heterozygous, and some genome assembly software has been designed to be able to handle this situation [5, 39, 40], but proper identification of hybrid lineages cannot be done without adequate population and phylogenetic analyses.
Thus, biological factors affecting genome assembly quality increase the overall costs of a project and require expertise that might not be available. Given the difficulty of performing analyses on low-quality assemblies, it is likely that published genomes are biased in favor of organisms with genomic characteristics that make them easier to work with. In contrast, genomic projects that choose organisms with nonstandard genomic architectures are more likely to suffer methodological obstacles that delay or even prevent analyses. Our inability to work around nonstandard genomic architectures distorts our perception of biological phenomena, relegating them to mere oddities.
Results
The Karyon toolkit
To aid in the identification of these noncanonical genomic architectures, we developed Karyon, a Python-based toolkit that assesses several parameters of sequencing data and their derived assemblies that are common indicators of different intrinsic genomic features leading to poor assemblies. Karyon comprises different modules that can be used independently or sequentially. Karyon is written in Python 3 and freely available to download as a Docker build or as a standalone project in https://github.com/Gabaldonlab/karyon.
Karyon integrates Trimmomatic [41] as an optional step to eliminate low-quality positions and adapters from sequencing reads. It then uses that input to generate a de novo assembly using SPAdes v3.9.0 [33], dipSPAdes v3.9.0 [40], Platanus v1.2.4 [39], or SOAPdenovo2 v2.04-r240 [42]. Karyon then uses the de novo assembly to generate a reduced assembly using Redundans [5]. Redundans is a pipeline that collapses assembly fragments with high similarity to create an artificial haploid genome assembly. This assembly is then used as reference to map the original sequencing reads using BWA-MEM [43] and generate a variant calling file with GATK v4.1.9.0 [44]. A battery of analyses is then performed on the sequencing libraries, the assemblies, and the maps of coverage and genetic variation to generate plots that will aid in the diagnosis of the genomic structure. Fig. 2 summarizes the pipeline.
Karyon uses the K-mer analysis toolkit (KAT) [36] to provide a k-mer (all possible sequences of length k) spectrum analysis as part of its report. From this analysis, it produces frequency histograms representing coverage versus k-mer counts. These plots inform on ploidy and heterozygosity of a genome. In a haploid genome, k-mers of enough size will appear either 1 or 0 times, with unique k-mers having an average coverage roughly equal to the average global depth of coverage. Deviations from these patterns suggest alternative architectures. For instance, the presence of 2 peaks in the k-mer plot typically indicates a genome that is totally or partially a nonhomozygous diploid. To complement these analyses and provide further information on the features of the genome, Karyon assesses scaffold length distributions, relationships between scaffold length and coverage, sliding-window analysis of coverage, and genetic variation, as well as allele frequency distributions per scaffold (Fig. 2). In addition, Karyon uses nQuire [38] to estimate the likelihood of different ploidy levels in sliding windows per scaffold. Karyon also incorporates BUSCO completeness analysis [45] with automatic taxonomic assignment. Altogether, the interpretation of these analyses can be used to detect polyploidies, aneuploidies, hybridizations, heterokaryosis, large segmental duplications, unusual DNA composition, or the presence of symbiont or contaminating sequences. Karyon generates a report file that summarizes the results of these analyses and raises some warning messages in case certain metrics are problematic, such as low BUSCO completeness, low percentage of mapped reads, or extreme GC% values.
Karyon generates a series of original plots that aim to provide valuable information regarding the architecture of the problem assembly:
Scaffold length plots (Fig. 3A). These plots represent the distribution of scaffold length through a barplot, where each value represents a single scaffold versus its length, with all scaffolds sorted from shortest to longest. Karyon generates the results in linear and logarithmic distribution. Very short scaffolds (shorter than 1 Kbp) might introduce noise in the analyses and might be interesting to just filter them out.
Scaffold versus coverage (Fig. 3B). Karyon generates a scatterplot representing the average coverage versus length for each scaffold in the assembly. Quite often, short scaffolds have different coverage from most of the genome, which might be indicative of contamination or repetitive regions.
Variation versus coverage plot (Fig. 3C). This plot allows the user to observe overall patterns across the whole genome. It uses a kernel density estimation over a cloud of dots. Each dot represents the number of single-nucleotide polymorphisms (SNPs) (x-axis) versus the average coverage (y-axis) in a window of the genome, typically 1 Kb. Presence of more than 1 population of dots is indicative of genomes with dual behavior, such as aneuploidies or loss of heterozygosity.
Fair coin plot (Fig. 3D). This plot represents the proportion of alternative versus reference SNPs for the whole genome and for each individual scaffold. Vertical lines indicate expected frequencies of 0.5, 0.33, and 0.25, corresponding to ideal diploids, triploids, and tetraploids, respectively. An expected frequency is drawn, which is based on a per-site simulation of proportions assuming ideal 0.5 relative frequencies and random sampling equal to the coverage of the site. The plot is generated for the whole genome, as well as per scaffold.
nQuire per scaffold plot (Fig. 3E). Karyon will run nQuire across sliding windows of defined length (by default 1 Kbp) across different scaffolds. The plot for each scaffold contains 5 subplots. The first 3 represent the nQuire score for diploid, triploid, and tetraploid for that particular window. This allows the user to visualize patterns of aneuploidy per scaffold, especially with regards to diploid and triploid regions. The fourth and fifth subplots represent the location and coverage of SNPs across the scaffold. A color code is assigned to represent the density. Since nQuire requires information of SNPs, homozygous regions cannot be assessed and will appear as missing data.
Each of the steps is optional and can be controlled with flags in the main script. Additionally, the script uses a configuration file, which allows to define the options of each dependency program. This configuration file is automatically created during the installation and can be modified with any text editor. We encourage the user to make a copy of the original configuration file for future modification. Installation is fully automated, requiring no user input during the process.
Genomic survey in the Mucorales (Fungi)
To showcase the use of Karyon, we undertook an analysis of deposited fungal genomes in the order Mucorales. Fungi are in a particularly privileged position to assess the impact of noncanonical genomic architectures in genome assemblies. Fungi generally have small and compact genomes and can be often cultured under axenic conditions. As a result, the amount of sequenced fungal genomes is now in the order of thousands, including multiple strains for many species. Even more, comprehensive efforts to obtain a balanced coverage of the existing fungal diversity are ongoing, such as the 1,000 fungal genomes [46] and the 1,000 yeast genomes initiatives [47–50]. Thus, fungi provide an excellent system to study the incidence of different genomic accidents in evolution [10, 51, 52]. Despite this, the quality of fungal genomes is often suboptimal, and databases are rife with highly fragmented assemblies. Genomic factors such as those discussed above might complicate genome assembly and be responsible for this observed fragmentation, at least partially. Considering this, we hypothesized that genome databases must contain a fraction of low-quality assemblies from fungal organisms that are caused by intrinsic genomic factors. If that is true, reanalysis of the raw data should lead us to describe novel genomic accidents and obtain a minimum estimate of their relative abundance.
We thus applied Karyon to a set of 35 publicly deposited genomes from the fungal order Mucorales. Our results suggest that nonstandard genomic organizations are not rare and that future studies on other groups are likely to uncover many new cases. We selected the order Mucorales because this group comprises several described examples of whole-genome duplication, both ancient and recent [53, 54]. Many sequenced members of the clade come from clinical samples, an environment that is known to promote the emergence of different genomic accidents [52, 55, 56]. Additionally, several represented species included 2 or more sequenced isolates, allowing to get a glimpse at their intraspecific diversity. We obtained 35 genome assemblies from 19 different Mucorales species deposited in GenBank between 1 January 2005 and 31 December 2015 (Table 1). For 4 of the species, dipSPADes was unable to generate an assembly.
Table 1:
Species | NCBI genome size (Mbp) | NCBI number of scaffolds | GenBank accession | Genome size after Karyon (Mbp) | Number of scaffolds after Karyon | Diagnosis | Reference |
---|---|---|---|---|---|---|---|
Rhizopus microsporus ATCC 62 417 | 49.6 | 1,386 | GCA_900000135.1 | 40.1 | 5,521 | Contamination | [57] |
Rhizopus microsporus CBS_344.29 | 49.2 | 1,554 | GCA_000825725.1 | 32.1 | 3,037 | Contamination | [57] |
Rhizopus microsporus B9738 | 75.1 | 5,266 | GCA_000697275.1 | 71.6 | 12,789 | Misidentification | [58] |
Rhizopus microsporus var. rhizopodiformis B7455 | 48.7 | 4,658 | GCA_000738565.1 | 21.8 | 2,176 | Contamination | [58] |
Rhizopus delemar type I NRRL 21789 | 42.0 | 3,921 | GCA_000697155.1 | 33.4 | 4,824 | Unknown | [58] |
Rhizopus delemar type II NRRL 21446 | 38.9 | 1,156 | GCA_000738605.1 | 33.7 | 5,071 | Unknown | [58] |
Rhizopus delemar type II NRRL 21447 | 38.7 | 1,177 | GCA_000738595.1 | 28.9 | 6,683 | Unknown | [58] |
Rhizopus delemar type II NRRL 21477 | 40.8 | 1,808 | GCA_000738585.1 | None | None | Unknown | [58] |
Rhizopus oryzae 99–892 | 39.1 | 1,168 | GCA_000697725.1 | 29.6 | 1,875 | Low GC% (below 35%) | [58] |
Rhizopus oryzae HUMC02 | 40.3 | 2,313 | GCA_000697605.1 | None | None | Unknown | [58] |
Rhizopus oryzae B7407 | 43.3 | 4,683 | GCA_000696915.1 | 34.7 | 3,720 | Low GC% (below 35%) | [58] |
Rhizopus oryzae type I NRRL 13440 | 43.4 | 5,022 | GCA_000697075.1 | None | None | Unknown | [58] |
Rhizopus oryzae type I NRRL 18148 | 47.5 | 14,653 | GCA_000697095.1 | None | None | Unknown | [58] |
Rhizopus oryzae type I NRRL 21396 | 42.8 | 4,445 | GCA_000697115.1 | 34.2 | 4,115 | Unknown | [58] |
Rhizopus oryzae 99–133 | 41.5 | 4,317 | GCA_000697135.1 | 27.2 | 1,332 | Unknown | [58] |
Rhizopus oryzae 97–1192 | 42.9 | 4,566 | GCA_000697195.1 | None | None | Unknown | [58] |
Rhizopus stolonifer B9770 | 38 | 5,567 | GCA_000697035.1 | 30.1 | 6,406 | Unknown | [58] |
Mucor circinelloides B8987 | 36.7 | 2,210 | GCA_000696935.1 | 29.9 | 4,864 | Unknown | [58] |
Mucor indicus B7402 | 39.8 | 3,117 | GCA_000697295.1 | 32.1 | 691 | Unknown | [58] |
Mucor racemosus B9645 | 65.5 | 6,360 | GCA_000697255.1 | 46.0 | 4,444 | Hemidiploid, low GC% (below 35%) | [58] |
Mucor velutinosus B5328 | 35.9 | 2,411 | GCA_000696895.1 | 28.2 | 2,743 | Unknown | [58] |
Lichtheimia corymbifera 008–049 | 36.6 | 1,629 | GCA_000697175.1 | 42.8 | 3,575 | Unknown | [58] |
Lichtheimia corymbifera B2541 | 36.6 | 1,176 | GCA_000697475.1 | 13.2 | 3,575 | Unknown | [58] |
Lichtheimia ramosa B5399 | 45.6 | 3,968 | GCA_000738555.1 | 26.6 | 861 | Aneuploid, hybrid | [58] |
Saksenaea oblongisporus B3353 | 40.8 | 1,702 | GCA_000697495.1 | 29.7 | 622 | Unknown | [58] |
Saksenaea vasiformis B4078 | 42.5 | 2,417 | GCA_000697055.1 | 32.7 | 1,506 | Unknown | [58] |
Cokeromyces recurvatus B5483 | 29.3 | 2,637 | GCA_000697235.1 | 26.6 | 5,213 | Low GC% (below 35%) | [58] |
Syncephalastrum monosporum B8922 | 29.6 | 1,284 | GCA_000697355.1 | 24.1 | 5,271 | Unknown | [58] |
Syncephalastrum racemosum B6101 | 29.6 | 1,035 | GCA_000696955.1 | 23.3 | 311 | Unknown | [58] |
Cunninghamella elegans B9769 | 31.7 | 1,380 | GCA_000697015.1 | 30.8 | 5,465 | Low GC% (below 35%) | [58] |
Apophysomyces elegans B7760 | 38.5 | 1,528 | GCA_000696995.1 | 29.3 | 1,293 | Unknown | [58] |
Apophysomyces trapeziformis B9324 | 35.8 | 1,400 | GCA_000696975.1 | 30.1 | 898 | Unknown | [58] |
Thermomucor indicae-seudaticae HACC 243 | 29.6 | 1,958 | GCA_000787465.1 | 25.7 | 4,118 | Unknown | Busk et al., unpublished. Genome submitted in 2014. |
Parasitella parasitica CBS 412.66 isolate NGI315 ade- mutant | 44.9 | 15,637 | GCA_000938895.1 | 23.5 | 3,295 | Unknown | [59] |
Karyon was run using the complete default pipeline. Most of the analyzed genomes (27, 79.4%) presented very low levels of heterozygosity and a relatively homogeneous coverage across the genome, suggesting that those strains are haploid or, if presenting higher ploidy, extremely homozygous. Fragmentation in these cases might be caused by insufficient coverage, presence of repetitive regions, or some other methodological constraints. However, our pipeline uncovered cases that produced anomalous results in the different Karyon tests. Many zygomycetous fungi exhibit low or very low GC%. In our dataset, 5 species showed GC% below our threshold of 35%, with several others approximating that value. Additionally, some of the analyzed genomes show signs of ploidy anomalies or contamination. Below we describe these cases and propose a plausible scenario to explain each of the obtained results based on the data obtained from the Karyon pipeline.
Rhizopus microsporus species complex
At the time of this study, the NCBI database had deposited sequences for 8 Rhizopus microsporus strains. Interestingly, 3 of them presented a genome size estimated around 25 Mbp, 4 had a genome size close to 50 Mbp, and 1 presented a genome size of 75 Mbp. Only the 3 strains with a genome size of 25 Mbp had sufficiently good assemblies considering they were based on short reads, with a scaffold number below 1,000, and thus were not selected for further analyses. Additionally, the raw libraries for one of the strains presenting a 50-Mbp genome assembly size (R. microsporus var. chinensis CCTCC M201021) were not publicly available and thus could not be part of the survey. For the remaining 3 strains with genome size close to 50 Mbp (ATCC62417, CBS344.29, and var rhizopodiformis B7455), our de novo assembly pipeline recovered a genome size of approximately 40 Mb, which is smaller than the assemblies deposited in NCBI (Table 1). The heterozygosity distribution in these assemblies shows that most of the genome presented a relatively uniform behavior with low heterozygosity. In all 3 cases, though, a considerable proportion of the genome appears with a highly variable coverage and increased heterozygosity (Fig. 4). For these 3 strains, BlobTools [25] shows widespread bacterial contamination (Fig. 4B), and thus we conclude that contamination might be responsible for the observed assembly fragmentation.
The remaining strain, B9738, showed a surprisingly large genome size in both the assembly deposited in NCBI (75 Mbp) and the one reconstructed here (71 Mbp). The genome of R. microsporus B9738 presents an extremely low level of heterozygosity and a very homogeneous coverage. The k-mer spectrum also shows just 1 very clear peak. All in all, all this suggests that B9738 is haploid (or a highly homozygous diploid), despite presenting a 3-fold increase in genome size as compared to other strains of the same species (Fig. 5). Augustus gene prediction returned a total of 21,300 gene models, which is an unusually large number for a filamentous fungus. As a reference, the 7 genomes in the Rhizopodaceae, to which Rhizopus belongs, available in Mycocosm range from 25 to 46 Mbp and from 10,781 to 17,676 annotated genes. Contamination analysis does not suggest the presence of widespread contamination that could explain such an overinflated genome (Fig. 5). For this reason, we suggest that B9738 might be a misidentified strain that does not belong to the R. microsporus species complex. Indeed, phylogenomic analyses recover B9738 as sister to a clade containing Mucor and Parasitella, rather than allied with the rest of the R. microsporus species clade (Fig. 6), thus supporting a misidentification. It is noteworthy that no sequenced species of either Mucor or Parasitella have genomes above 49 Mbp or with more than 15,000 genes, at least from the available genomes in Mycocosm.
Mucor racemosus B9645
Analyses on Mucor racemosus B9645 depicted a genome with a dual behavior. The distribution of heterozygosity and coverage showed 2 peaks with very low heterozygosity but with different coverage (Fig. 7B). This was further confirmed by the k-mer spectrum analysis, which revealed 2 clear peaks (Fig. 7A). The genome available in NCBI is 65.5 Mbp long, noticeably larger than the 45.9 Mbp we recovered in our analyses (Table 1). The reduction step of Redundans cannot explain this difference, as the assembly size prior to this step is already 46.8 Mbp, very close to the final result. Our analyses suggest that contaminating sequences are very minor and do not explain the observed pattern (Fig. 7). We hypothesize that M. racemosus B9645 is a hemidiploid, which presents a portion of its genome in a haploid state and the other portion in a highly homozygous diploid state. Due to the low heterozygosity exhibited by this strain, the observed genome architecture might have arisen by either autopolyploidization followed by chromosome loss or by chromosomal duplications. Additionally, GC% for this species was only 32.6%.
Lichtheimia ramosa B5399
The Karyon assembly for this genome was only 26.6 Mbp, much smaller than the NCBI assembly (45.6 Mpb long, Table 1). Unlike other genomes, our assembly presented a considerable improved quality, going from 3,968 scaffolds and N50 of 33,650 in the NCBI assembly to 861 scaffolds and N50 of 133,635 in our own assembly. L. ramosa presents a heterozygosity level around 3% in its diploid peak (Fig. 8). All considered, we propose that L. ramosa B5399 is a mix of haploid and diploid with high heterozygosity, likely resulting from mating between 2 distantly related strains followed by genomic aneuploidization.
Methods
Sequencing data
We downloaded raw data from libraries deposited at the Short Read Archive (SRA) of those species in the Mucorales with a highly fragmented assembly (>1,000 scaffolds), which included at least 1 paired-end Illumina library larger than 1 Gb after quality filtering (Table 1), to ensure at least a decent coverage. Since most of our genomes have typical assembly sizes around 40 Mbp, this measure ensures a bare minimum average coverage of 20. All available sequencing libraries were used for all the analyses.
De novo gene annotation
We used Augustus v3.1.0. [60] to obtain a de novo gene prediction using the Rhizopus oryzae generalized hidden Markov model included in the default installation of Augustus.
Contamination detection
For each of the conflictive assemblies, we generated an Augustus prediction. Then, we used Blastp [61] to query the whole proteome against Uniref100 [62]. Since the genomes come from public databases, their own proteins should appear as hits, and thus we retrieved the 10 best hits. We have used these hits to assign a taxonomic profile. Additionally, we have used the predicted Augustus CDS to map sequencing reads with GATK. With both the taxonomic profile and the variant calling file, we have run BlobTools [25] to identify the presence of widespread contamination in the sequencing libraries.
Phylogenomic analyses
In order to identify the phylogenetic position of R. microsporus B9738, we used the Augustus gene prediction and the proteome of 24 other zygomycetes to run OrthoFinder v.2.3.3 [63] with the flags -S blast and -m msa.
Discussion
As genome sequencing has moved away from model organisms, it has become apparent that many possible genomic architectures are possible, and many do exist in a wide range of organisms. Most of these genomic accidents are difficult to identify from sequencing data alone. As far as we know, Karyon is the first software developed with the intention of performing reference-free analyses for the presence of a wide array of genomic factors affecting the quality of de novo genome assembly. We have designed this software to be easy to install and use, with the possibility of installation from both GitHub and Docker.
Despite the success in the implemented strategy, we consider our software has several limitations. Karyon requires an assembly step and variant calling protocol, for which some default options are included. However, the included programs might not suit every need. For example, extremely large genomes might require alternative assemblers that are not included in our pipeline, or some users might prefer a different set of programs for the variant calling protocol. For those cases, Karyon can still be used as independent steps (Fig. 2). Karyon is designed to work without any preexisting data, which limits the information it can predict. Comparing different genome assemblies, especially if at least one of them has good quality, can help detect many of these alternative genomic architectures and some others that are outside the capabilities of Karyon. If other reference genomes are available, tools like QUAST [64] can generate similar analyses to Karyon with higher accuracy and speed.
Despite the increasing use of long-read technologies for assembly purposes, a large amount of genome assemblies available in public databases have been generated exclusively from short reads. As of October 2021, NCBI SRA contains 173,087 DNA libraries for Fungi, of which 157,144 are Illumina short reads, and only 6,163 are long reads (4,700 PacBio and 1,463 Nanopore). At this moment, the pipeline assumes the use of at least 1 Illumina paired-end sequencing library. Because of this, we recommend the use of other genome assemblers if other sequencing technologies (i.e., Nanopore or PacBio long reads) are to be used, and the same goes for variant calling protocols.
We provided a practical example of the usage of Karyon on a publicly available set of fungal genomes from the order Mucorales. While most analyzed assemblies show no sign of any of the considered biological conditions, we were able to effectively find underlying nonstandard genomic architectures that had been previously unnoticed in these assemblies. These results suggest that many authors do not take into consideration these kinds of genomic accidents, which in turn greatly hampers the results that might be obtained from them.
How common are these nonstandard genomic architectures? Our results suggest that they might be quite abundant, although so far they are restricted to a limited selection of species within a narrow clade of Fungi. As such, these genomic anomalies might, or might not, be common in other lineages. However, we consider that there are 3 important arguments in favor for considering our dataset an underestimation of the abundance of unorthodox fungal genomes, even within the limited taxonomic range we have selected. The first one is the fact that fungal biomass used for DNA extraction and subsequent sequencing typically comes from cultures. This implies an important ecological step in which the fungus grows at optimal speed and in the absence of most stressors. Aneuploidies, polyploidies, and other similar genomic rearrangements are common in the presence of stressors [9, 10, 52, 65] but seem to be outcompeted by euploid cells under optimal growth conditions [24, 66, 67]. Hence, isolates growing in rich medium will be selected to lose most chromosomal aberrations they might present. Analogously, many of these chromosomal aberrations might exist in nature but are unable to grow on optimal medium. The advance of environmental sequencing and single-cell-based technologies might cast some light in this matter in coming years. Supporting this argument, Ahrendt et al. [68] sequenced several environmental isolates of zoosporic and zygomycetous microfungi using these techniques and found several aneuploids and polyploids. The frequency of unconventional genomic architectures is very likely lineage dependent. While some of these are well known, such as the dikaryotic phase in Agaricomycetes or the macro- and micronuclei of ciliates, strange genomic architectures might be common in more obscure lineages. This not only represents a yet-to-know facet of the biology of these organisms but could also potentially complicate their study. The third factor to consider is purely human. The datasets we have analyzed were uploaded by researchers who considered they were good enough to be uploaded to a public repository. Thus, it is to be expected that many more low-quality assemblies would have never been never deposited and sit forgotten in the disks of laboratory computers, if not discarded completely.
Even if we consider these possible biases as negligible, our results recover a significant fraction of publicly available genomes with unorthodox genomic configurations. These have been correlated in many fungal groups with adaptation to novel environments [69–71], resistance to antifungals [72, 73], pathogenic capabilities toward both animals [55, 74–76] and plants [77, 78], and adaptation to industrial settings [47, 79–83]. Beyond that, contamination in sequencing libraries is a problem that can affect any assembly project and might mislead downstream inferences if left unaddressed. Validation of published results goes far beyond the interest of discovering overlooked findings. Comparative genomic studies are limited in their scope and reliability by the quality of assembly and annotation of the genomes, factors that can be greatly compromised by these biological factors. Comparative studies commonly require the use of flagship genomes that represent a given taxon. Often, this generates a chronology of comparisons versus the reference that shapes the perspective on the group. As such, artifacts and errors in strategic genome assemblies, such as reference strains or strains in groups with few represented species, might have a domino effect, impacting future studies. Long-read sequencing technologies, which are increasingly being used for genome assembly projects, hold the promise of providing much more information that could be used to resolve many of these unorthodox genomic architectures. However, these approaches require novel computational approaches to fully employ their potential.
Availability of supporting source code and requirements
Project name: Karyon
Project homepage: https://github.com/Gabaldonlab/karyon
Operating system(s): e.g., Linux, any with the Docker image
Programming language: Python, Bash
Other requirements: Check the installation
License: GNU GPL v3
RRID: SCR_022544
biotools ID: karyon
Data Availability
All data used for this study were downloaded from NCBI SRA. Table 1 contains accession numbers for all the data. An archival copy of the code and test data is available via the GigaScience database GigaDB [84].
Abbreviations
BUSCO: Benchmarking Universal Single-Copy Orthologs; NCBI: The National Center for Biotechnology Information; SNP: single-nucleotide polymorphism; SRA: Short Read Archive.
Funding
Supported by the Spanish Ministry of Science and Innovation (grant PGC2018-099921-B-I00), cofounded by the European Regional Development Fund (ERDF); the Catalan Research Agency (AGAUR), SGR423; the European Union's Horizon 2020 research and innovation program (ERC-2016–724173); the Gordon and Betty Moore Foundation (grant GBMF9742); and the Instituto de Salud Carlos III (IMPACT grant IMP/00019 and CIBERINFEC CB21/13/00061-ISCIII-SGEFI/ERDF).
Supplementary Material
giac088_GIGA-D-21-00155_Original_Submission
giac088_GIGA-D-21-00155_Revision_1
giac088_GIGA-D-21-00155_Revision_2
giac088_GIGA-D-21-00155_Revision_3
giac088_GIGA-D-21-00155_Revision_4
giac088_Response_to_Reviewer_Comments_Original_Submission
giac088_Response_to_Reviewer_Comments_Revision_1
giac088_Response_to_Reviewer_Comments_Revision_2
giac088_Response_to_Reviewer_Comments_Revision_3
giac088_Reviewer_1_Report_Original_Submission
Zhong Wang -- 6/29/2021 Reviewed
giac088_Reviewer_1_Report_Revision_1
Zhong Wang -- 12/2/2021 Reviewed
giac088_Reviewer_2_Report_Original_Submission
Michael F. Seidl -- 7/4/2021 Reviewed
giac088_Reviewer_2_Report_Revision_1
Michael F. Seidl -- 11/29/2021 Reviewed
giac088_Reviewer_3_Report_Original_Submission
Kamil S. Jaron -- 7/10/2021 Reviewed
giac088_Reviewer_3_Report_Revision_1
Kamil S. Jaron -- 1/10/202 Reviewed
giac088_Reviewer_3_Report_Revision_2
Kamil S. Jaron -- 5/17/2022 Reviewed
Contributor Information
Miguel A Naranjo-Ortiz, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain. Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain. Biology Department, Clark University, Worcester, MA 01610, USA. Naturhistoriskmuseum, University of Oslo, Oslo 0562, Norway.
Manu Molina, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain. Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain. Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain.
Diego Fuentes, Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain. Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
Verónica Mixão, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain. Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain. Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain. Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain.
Toni Gabaldón, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona 08003, Spain. Health and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona 08003, Spain. Life Sciences Department, Barcelona Supercomputing Centre (BSC-CNS), Barcelona 08034, Spain. Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona 08028, Spain. ICREA, Pg. Lluís Companys 23, Barcelona 08010, Spain. Centro de Investigación Biomédica en Red de Enfermedades Infecciosas, Barcelona 28029, Spain.
References
Articles from GigaScience are provided here courtesy of Oxford University Press
Full text links
Read article at publisher's site: https://doi.org/10.1093/gigascience/giac088
Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/gigascience/article-pdf/doi/10.1093/gigascience/giac088/46362307/giac088.pdf
Citations & impact
Impact metrics
Alternative metrics
Discover the attention surrounding your research
https://www.altmetric.com/details/136901647
Article citations
Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other nonstandard architectures in genome assemblies.
Gigascience, 11:giac088, 01 Oct 2022
Cited by: 1 article | PMID: 36205401 | PMCID: PMC9540331
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
Data Citations
- (1 citation) DOI - 10.5524/102242
GCA - NCBI genaome assembly (Showing 34 of 34)
- (1 citation) GCA - GCA_000696995.1
- (1 citation) GCA - GCA_000696975.1
- (1 citation) GCA - GCA_000738605.1
- (1 citation) GCA - GCA_000696955.1
- (1 citation) GCA - GCA_000738585.1
- (1 citation) GCA - GCA_000697355.1
- (1 citation) GCA - GCA_000696895.1
- (1 citation) GCA - GCA_000697235.1
- (1 citation) GCA - GCA_000738565.1
- (1 citation) GCA - GCA_000697295.1
- (1 citation) GCA - GCA_000938895.1
- (1 citation) GCA - GCA_000697255.1
- (1 citation) GCA - GCA_000697275.1
- (1 citation) GCA - GCA_000697135.1
- (1 citation) GCA - GCA_000825725.1
- (1 citation) GCA - GCA_000697115.1
- (1 citation) GCA - GCA_000738595.1
- (1 citation) GCA - GCA_000697175.1
- (1 citation) GCA - GCA_000697155.1
- (1 citation) GCA - GCA_000697195.1
- (1 citation) GCA - GCA_000697725.1
- (1 citation) GCA - GCA_000697475.1
- (1 citation) GCA - GCA_000697495.1
- (1 citation) GCA - GCA_000697015.1
- (1 citation) GCA - GCA_900000135.1
- (1 citation) GCA - GCA_000696915.1
- (1 citation) GCA - GCA_000738555.1
- (1 citation) GCA - GCA_000696935.1
- (1 citation) GCA - GCA_000787465.1
- (1 citation) GCA - GCA_000697055.1
- (1 citation) GCA - GCA_000697035.1
- (1 citation) GCA - GCA_000697075.1
- (1 citation) GCA - GCA_000697095.1
- (1 citation) GCA - GCA_000697605.1
Show less
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Sequencing smart: De novo sequencing and assembly approaches for a non-model mammal.
Gigascience, 9(5):giaa045, 01 May 2020
Cited by: 9 articles | PMID: 32396200 | PMCID: PMC7216774
Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies.
G3 (Bethesda), 9(3):879-887, 07 Mar 2019
Cited by: 17 articles | PMID: 30674538 | PMCID: PMC6404606
Single-Molecule Real-Time Sequencing Combined with Optical Mapping Yields Completely Finished Fungal Genome.
mBio, 6(4):e00936-15, 18 Aug 2015
Cited by: 86 articles | PMID: 26286689 | PMCID: PMC4542186
Mind the gap; seven reasons to close fragmented genome assemblies.
Fungal Genet Biol, 90:24-30, 02 Sep 2015
Cited by: 53 articles | PMID: 26342853
Review
Genetic variation and the de novo assembly of human genomes.
Nat Rev Genet, 16(11):627-640, 07 Oct 2015
Cited by: 181 articles | PMID: 26442640 | PMCID: PMC4745987
Review Free full text in Europe PMC
Funding
Funders who supported this work.
European Regional Development Fund
Gordon and Betty Moore Foundation (1)
Grant ID: GBMF9742
Horizon 2020 Framework Programme
Spanish Ministry of Science and Innovation (1)
Grant ID: PGC2018-099921-B-I00