Subsections


6. Small Variant (SNP/MNP) Analysis


6.1 Introduction

Most organisms within a particular species differ very little in their genomic structure. A single nucleotide variation between two complementary DNA sequences corresponding to a specific locus is referred to as a SNP or Single Nucleotide Polymorphism and is the most common variation observed within a species. This variation could be due to insertion, deletion or substitution of a nucleotide. The variations are referred to as allele changes at the specific chromosomal co-ordinates. Typically, SNPs commonly observed in a population exhibit two alleles--a major allele, which is more prevalent, and a relatively rarely occurring minor allele. Such variations, if they occur in genic regions, can result in changed phenotypes and may even cause disease.

Sometimes, SNPs occur together. A set of adjacent SNPs is termed a Multiple Nucleotide Polymorphism or MNP. Thus MNP alleles have multiple nucleotides.

Microarrays for SNP detection use probes with specific targets while searching for SNPs. Next-generation sequencing (NGS) allows SNP identification without prior target information. The high coverage possible in NGS also facilitates discovery of rare alleles within population studies.

SNP detection algorithms compare the nucleotides present on aligned reads against the reference, at each position. Based on the distribution of As, Ts, Gs, and Cs at that position, and the likelihood of error, a judgement is made as to the existence of a SNP.

Some issues that must be handled by SNP detection algorithms are mentioned below:

  • Quality of base-calls

    A sequencer outputs a sequence of nucleotides corresponding to each read. It also assigns a quality value based on the confidence with which a particular base is called. Clearly, SNP detection algorithms must put greater weight on bases called with higher confidence.

  • Mapping quality of reads

    Most alignment algorithms assign quality scores to a read based on how well the read aligned with the reference. These scores are relevant in SNP detection because they measure the likelihood of a read originating from the suggested position on the reference. Even if the individual bases on a read are called with high quality values, the read may align imperfectly with the reference. The mapping quality score takes into account the inserts, deletes, and substitutions necessary for alignment at a particular position.

  • Depth of coverage

    The number of reads ``covering'' a position also determine how confidently a SNP can be called. Obviously greater sequencing depths lead to higher SNP calling accuracy.

  • Homopolymer

    A homopolymer is a repetitive sequence of a single nucleotide, e.g., AAAAAA. Sequencers often exhibit inaccurate representations of homopolymers and their immediate neighbors due to limitations in the next-generation sequencing technology. Such regions need to be handled carefully by SNP detection algorithms.

  • Ploidy

    Ploidy is the number of sets of chromosomes in a cell. Haploid organisms have one set of chromosomes per cell, while diploid organisms like humans have two. Polyploidy, the state where all cells have multiple (but more than two) sets of chromosomes is chiefly observed in plants. SNP detection must take into account the ploidy while calling SNPs. For example, in a sample of a haploid individual organism, each position can correspond to only one nucleotide. Reads must ideally agree at each position, and any deviation is easily detected as a sequencing error. Diploid organisms inherit one set of chromosomes from each parent and so reads can show two nucleotides at each position, one from each set.

    Currently, SNP analysis in Strand NGS assumes that the ploidy is two. In this case a SNP can be called under two circumstances:

    1. Homozygous SNP

      One when the consensus reached from the sample reads shows a single nucleotide at the location and this consensus differs from the reference nucleotide. In Fig. 6.1, the highlighted position shows a T in the reference genome but a consensus nucleotide of C among the reads. The nucleotide G in one of the reads is attributed to sequencing error.

      Figure 6.1: Homozygous SNP
       
      homozygous_SNP.jpg

    2. Heterozygous SNP

      Alternatively, the sample reads may show considerable presence of two different nucleotides; typically one of the two agrees with the reference nucleotide. The two alleles should ideally show close to 50% presence each. In Fig. 6.2, the reference genome shows a T at the locus of interest, whereas the reads show a combination of T and C bases.

      Figure 6.2: Heterozygous SNP
       
      heterozygous_SNP.jpg

In the following we describe the two main steps in the SNP analysis: i) SNP Detection, and ii) SNP Effect Analysis. SNP Detection identifies the SNPs by comparing the given sample's genome with the reference genome. SNP Effect Analysis takes the list of SNPs detected in the first step and generates a report indicating the effect that these SNPs have on the genes in a given annotation.

6.2 Local Realignment

6.2.1 Background and Motivation

Since alignment algorithms map each read independently to the reference genome, it may result in some alignment artefacts. This means that SNPs or indels can be improperly placed with respect to their true location. This could be because of 1) sequencing errors, 2) several equally likely positions of the variant, and 3) adjacent variants or near by errors [40]. In particular, insertions and deletions especially towards the ends of the reads are difficult to anchor and resolve without the use of multiple reads.

These misalignments can lead to false positive SNP calls as well as skewed recalibrated base quality scores and there by have critical impact on the downstream results. Therefore the most important objective behind realigning reads around indels is to reduce the number of false positive variant calls.

Consider for example the case shown in Fig. 6.3, which demonstrates the issues of alignment artefacts. If we shift the 6 reads having C,T as mismatch bases to the left by having $ C$ as insertion, then the T bases cease to be mismatches and the inserted sequence matches other reads which already had an insertion at that locus.

Figure 6.3: Example showing alignment artefacts in a potential genomic region for local realignment.
Image lr_example1

As another example, consider the case shown in Fig. 6.4. In the reads that show mismatches, if we shift $ TG$ to the right to allow for $ 3$ deletions, there will be no mismatches and the percentage of supporting reads for deletion would be increased.

Figure 6.4: Example showing alignment artefacts in a potential genomic region for local realignment.
Image lr_example2

6.2.2 Local Realignment Approach

Some of the popular approaches for local realignment include $ GATK$ [42], $ Dindel$ [38] and $ SRMA$ [40]. In fact both GATK and Dindel, barring few differences use same method conceptually and we have adopted ideas from both of these approaches in Strand NGS. $ SRMA$ on the other hand, follows a different approach, which uses short read alignment coupled with assembly inspired approach and builds a variant graph for each region from the initial alignments. Once the variant graph is built, all reads are realigned according to the variant graph. We will not discuss the $ SRMA$ approach here as we have mostly adopted the ideas from $ GATK$ and $ Dindel$.

Below are the four basic steps that are used in our implementation:

  • Identify regions of reads to be realigned
  • Generate candidate haplotypes in each identified region
  • Infer most likely haplotypes
  • Realign reads to most likely haplotypes

In the following we will discuss each of the above four steps in more detail:

6.2.2.1 Identification of candidate regions for local realignment

The genome is first divided into windows of non-overlapping reads and then each window is considered separately to identify candidate regions for local realignment. In each window, we look at all the indels and evaluate local region around each one of them. Specifically, we check the following three things for each indel:
  • Minimum indel coverage: Unlike the approach adopted by $ GATK$, which considers all indels that are supported by at least one read, we require every indel to be supported by at least $ 5$ reads. Indels that do not satisfy this criteria are discarded.
  • Mismatches in neighbourhood of indel: For every indel that satisfies the `minimum coverage', we check if it has at least one mismatch in its neighbourhood ($ +/- 20bp$).
  • Possibility of merging near-by indels: Indels that satisfy the `minimum coverage' requirement and have mismatches in it's neighbourhood, are further evaluated to assess the possibility of merging them into a single candidate region. If two indels are within a fixed distance $ l_{m}$ (called indel merging window width) apart, they will be considered in a single common region for realignment.

To output the final candidate regions for local realignment, we use a parameter called `Indel effect boundary' window of length $ l_{e}$, and determine left effect boundary and right effect boundary for each region (a region may comprise either single indel or multiple closeby indels) identified in the previous step. In the case of single indel region, length of the candidate region for local realignment is equal to $ l_{e}$, with indel occurring in the middle of the effect boundary window. In the case of merged indel region, left effect boundary is governed by the leftmost indel (basically a distance of $ l_{e}/2$ from the left indel) and right effect boundary is governed by the rightmost indel (a distance of $ l_{e}/2$ from the right indel).

To further illustrate all the steps involved in the identification of candidate regions for local realignment, let us consider an example as shown in Fig.  6.5.

Figure 6.5: Identification of candidate regions for local realignment.
Image lr_IdentificationCandidateRegions

This figure shows $ 5$ indels (indicated by red color) in the region. First, for each indel we check 1) minimum indel coverage; 2) mismatches in the local neighborhood; and 3) possibility of merging indels. Only those indels are chosen that satisfy both the criteria 1) and 2). Out of the five indels considered, $ 4^{th}$ indel does not satisfy the min coverage criteria, and hence discarded, $ 5^{th}$ indel had no mismatches in the local neighborhood, and hence discarded too. Of the remaining three indels, we can check the merging criteria. As shown in the figure, based on the window size used, first two indels can be merged together, while third one is by itself. Finally to obtain the candidate regions, for each region (whether merged or single), left and right effect boundaries are determined using the `effect boundary' window as shown in the figure.

6.2.2.2 Generation of candidate haplotypes

Once we have the candidate regions for local realignment as identified in the previous step, we generate candidate haplotypes in each of the candidate regions. The steps are depicted in Fig. 6.6. First a partition set is created using information from start and end positions of all the reads that are aligned to this region. Then in each partition or sub-block, we count all possible haplotypes and record their frequency (this step is similar to Dindel approach [38]). This is repeated for all partitions and then eventually n block haplotypes with the highest empirical frequency are obtained. Using these n block haplotypes, we have at most $ 2^{n}$ candidate haplotypes. In addition, we allow known variants to be provided from outside and consider at most $ k$ variants per candidate region. In that case also, we consider all possible combinations and hence we have at most $ 2^{k} * 2^{n}$ candidate haplotypes that should be evaluated using the read data.
Figure 6.6: Candidate Haplotype Generation (Figure taken from Albers et. al., Genome Research 2010).
Image lr_HaplotypeGeneration

6.2.2.3 Inferring most likely haplotypes

Once we have obtained the candidate haplotypes, they are evaluated using all the reads. To do this, each read is aligned to each haplotype in a gapless manner and a score is calculated as shown in Eqn. 6.1. Assume $ R_{j}$ is the $ j^{th}$ read, $ H_{i}$ is the $ i^{th}$ haplotype, we define score, $ L$ as the following:

$\displaystyle L(R_{j}\vert H_{i}) = \prod_{k} L(R_{j,k}\vert H_{i,k+s_{ji}})$ (6.1)

$\displaystyle L(R_{j,k}\vert H_{i,k+s_{ji}}) = \begin{cases}1-\epsilon_{j,k}, &...
...+s_{ji}}$}\ \epsilon_{j,k}, & \text{$R_{j,k} \neq H_{i,k+s_{ji}}$} \end{cases}$ (6.2)

Here, $ R_{j,k}$ is the $ k^{th}$ base of read $ R_{j}$, $ \epsilon_{j,k}$ is the error rate corresponding to the declared quality score for that base, and $ s_{ji}$ is the offset in the gapless alignment of $ R_{j}$ to $ H_{i}$.

The $ L$-scores are combined across all the reads to get the score for each haplotype as shown in Eqn. 6.3. This is similar to the approach followed in $ GATK$. However, $ GATK$ chooses only one alternate haplotype that yields the maximum score. In our approach, we evaluate all the haplotypes including the reference haplotype but choose the best two based on their scores. These selected two may or may not have the reference haplotype.

$\displaystyle L(H_{i}) = \prod_{j} L(R_{j} \vert H_{i})$ (6.3)

6.2.2.4 Realigning reads using most likely haplotypes

Finally to realign the reads, we select the two best haplotypes $ H_{i_{1}}$, and $ H_{i_{2}}$ (as mentioned earlier, unlike GATK, this may or may not include the reference haplotype) based on their scores. However reads are aligned to the two haplotype model instead of a single reference haplotype model only if the log odds ratio of the two-haplotype model is better than single haplotype model by at least 5 units, as shown in Eqn. 6.4

$\displaystyle \frac{L(H_{i_{1}},H_{i_{2}})}{L(H_{o})} = \frac{\prod_{j} max(L(R_{j}\vert H_{i_{1}}),L(R_{j}\vert H_{i_{2}}))}{ \prod_{j} L(R_{j}\vert H_{o})}$ (6.4)

6.3 Split Read Realignment

Split read realignment (SRR) aims to increase support for low frequency structural variants (SV) by splitting and realigning poorly aligned reads.

6.3.1 Preprocessing

The primary requirement for running split read realignment is that the alignment is done in Strand NGS with the Split Alignment option enabled. This is because preprocessing involves formation of cluster pairs (based on existing completely split reads) which are used to realign poorly aligned reads. Intersecting clusters are discarded. The cluster generation is done on the input read list and the output from this step is a region list containing the cluster coordinates. The details of the algorithm used for cluster generation can be found in section 7.4.

6.3.2 Realignment

Every cluster region produced as an output from the preprocessing step contains information about the partner cluster, specifically, the chromosome, start and end coordinates. It also contains information regarding the breakpoints of both, itself and the partner cluster. Based on this information, for each cluster we identify the candidate reads for realignment. These reads can be classified into two types:

  1. Partially Aligned Reads: A read is partially aligned when only one segment of the read aligns. The other segment might not have aligned due to no matches or too many matches.
  2. Noisy Reads: A read is called noisy if from either end of the read, there are more than 4 events in a neighborhood of 10 contiguous reads.

Furthermore, a partially-aligned read in a cluster is considered only if the direction of the clipped end of the read is the same as the direction of the breakpoint for that particular cluster and if the last base of the clipped end is not more than 25 bases away from the breakpoint. Similarly a noisy read is considered only if the end with the mismatches is in the same direction as the breakpoint of that cluster.

For every cluster pair, the reference sequence corresponding to each cluster's chromosomal coordinates is fused to the other with a defined padding. This padding is done on both sides of the cluster and the length of the padding is dependent on the read (Refer to figure 6.7). For a partially aligned read, the length of the padding is equal to the clipped length of the read. For a noisy read, the padding length is equal to the length of the noisy part. The orientation of the individual clusters is taken into account while fusing the reference fragments. Once a fused reference is generated, the candidate read is split-aligned against this fused reference. A read is said to be realigned if the combined alignment score of the two fragments of the read crosses the threshold specified for split alignment in the Tools $ \longrightarrow$Options $ \longrightarrow$Alignment $ \longrightarrow$Split Alignment. Post split read realignment, if the realigned clipped/noisy segment is less than 5 bases, then we discard the realignment.

Figure 6.7: Split Read Realignment
Image srr_method

6.4 Base Quality Recalibration

6.4.1 Background and Motivation

Base quality score is the confidence of the sequencer in calling a base. In other words, base quality reflects the probability of a sequencer making an error and calling incorrect base. Typically, base quality score is reported on a Phred scale as defined below:

Base Quality Score$\displaystyle = -10 * log_{10} (p)$ (6.5)

For example, a base quality score of 20 on a Phred scale means an error probability of 1 in 100, a score of 30 would mean an error probability of 1 in 1000, and so on. These sequencing errors may either result due to random sequencer noise or systematic biases. The systematic biases could be machine-, run-, or sample-specific. Bayesian variant caller, like the one used in Strand NGS, uses the base qualities in calling out variants. Therefore inaccurate or biased base qualities may result in spurious variant calls. If base qualities are over-estimated (under-estimated), it may result in false positive (false negative) variant calls. This problem may not be severe in case of high coverage samples however for low coverage samples, recalibrating the base quality scores may help reduce spurious variant calls.

6.4.2 Types of Approaches

There are two types of approaches for base quality score recalibration.
  • Data driven empirical approach: This is presented in the paper by DePristo et. al. [42] and implemented in the software suite GATK. In this approach, a recalibration table is built which can then be used to recalibrate base quality scores. To populate the table, this approach uses all the data. In fact, if multiple samples are available, it is recommended to pool the data.
  • Machine learning based logistic regression approach: This is presented in the paper by Cabanski et. al. [41] and implemented in a R package called ``ReQON''. In this approach, a logistic regression model is learned and used to recalibrate the quality scores. For model training, `ReQON' can train on a smaller set of data. In fact, for large training data, logistic regression model can be built on separate data chunks and then median regression coefficients can be used.
Based on the above, it is apparent that ReQON model has relatively solid theoretical underpinnings and can be used with smaller training data. However, the ease of implementation, usability and popularity of GATK's empirical approach, motivated us to adopt the empirical model in Strand NGS for base quality score recalibration. Below we describe the empirical approach in more detail.

6.4.3 Empirical Approach for Base Quality Score Recalibration

Empirical method for base quality score recalibration as proposed in [42] identified the following four covariates on which base quality score depends upon.
  • Machine Cycle: Based on the sequencer, machine cycle can either be represented as flow cycle (for e.g. 454 and Ion Torrent) or discrete cycle (for e.g. Illumina and PacBio). In flow cycle, each flow grabs all the TACG's in order in a single cycle. So consider for e.g. the sequence AAACCCCGAAATTTTTACTC, flow cycle would be 1 for the A's, C's and G's since they are in order of TACG. Next set of A's would be one cycle by itself. The subsequent T's, A, and C would be captured in the 3rd cycle, followed by TC being captured in the 4th cycle. In discrete cycle, machine cycle is simply the position in the read, which should be counted in the reverse direction for a negative strand read.
  • Nucleotide Context: Another covariate of interest is nucleotide context, which looks into the sequence chemistry effects. The fundamental question here is what is the likelihood of mis-calling a base A, if it is preceded by A or T or G or C? In other words, is the error probability independent of the preceding base(s) or dependent upon it?. It turns out that preceding base makes a difference in the sequencing error of the called base. Although this effect can be studied for one or more preceding bases, we only consider di-nucleotide context i.e. the effect of one preceding base on the likelihood of the sequencing error of the called base. However, for homopolymer stretches, we use longer context, up to 8 bases. So we have a total of 40 combinations (16 for dinucleotides and 24 for homopolymer stretches).
  • Read Group: Another covariate that can affect the base quality scores is read group. Sample ID is typically taken as the read group. For illumina, which has a lane concept, (sample ID + lane ID) can be used as read group.
  • Original Base Quality: Finally reported original base quality is used as another covariate since the recalibrated base quality score, in addition to being dependent upon other covariates as discussed above, would of course depend upon this as well.

In Strand NGS , by default the three covariates, namely machine cycle, nucleotide context, and the original base quality are used. Once the covariates are identified the empirical base quality score recalibration method involves two steps:

  • Empirical model training (i.e. Generating recalibration table): In order to generate the recalibration table using the data, the basic assumption is to consider all mismatches as sequencing errors, except the mismatches at the locations that are known to vary, for e.g. in dbSNP. This is done to avoid true variations being considered as sequencing errors. For the remaining locations, we look at every base and categorise it in one of the bins according to the covariates described above i.e. original reported base quality, machine cycle, and nucleotide context. So, for three covariates, if we try to visualise these bins, it will result in a cube where each cell of the cube represent one bin, and with each bin consisting of several data points (base locations in this case). Finally to generate and populate the recalibration table, using these data points, for each cell we compute the number of mismatching bases and total number of bases. One can now easily appreciate why the availability of more data is useful for generating more accurate recalibration table.

    $\displaystyle mismatches(R, C, D) = \sum_{r \in R} \sum_{c \in C} \sum_{d \in D} b_{r,c,d} \neq b_{ref}$ (6.6)

    $\displaystyle bases(R, C, D) = \sum_{r \in R} \sum_{c \in C} \sum_{d \in D} \vert b_{r,c,d} \vert$ (6.7)

    Here capital letters $ R$, $ C$, and $ D$ denote the set of all reported base quality scores, machine cycle, and nucleotide context respectively. On the other hand, small letters $ r$, $ c$, and $ d$ denote specific instances or values of reported base quality scores, machine cycle, and nucleotide context respectively. $ b$ is used to denote the base call, and $ ref$ is used to denote the reference base.

    Once we have determined the number of mismatches and total bases in each cell, we can compute the empirical base quality scores, which is nothing but the ratio of mismatches and total bases, but corrected using Yates correction. Yates correction is done to avoid divide by 0 cases. So, we start with two bases in each bin and assume that one was a mismatch and other was not, therefore $ 1$ would be added to the mismatch count and $ 2$ would be added to the count of total bases.

    $\displaystyle Qempirical(R, C, D) = \frac{mismatches(R, C, D) + 1}{bases(R, C, D) + 2}$ (6.8)

  • Recalibrate base quality scores using the generated recalibration table: After generating a recalibration table as explained above, quality score of each base in every read can be recalibrated using the equation below:

    (6.9)

    Here $ recal(r, c, d)$ denotes the recalibrated base quality score given a specific value of original reported quality $ r$, machine cycle $ c$, and nucleotide context $ d$. The right hand side terms are empirical qualities that were obtained in the first step as shown in Eqn. 6.8. These terms are nothing but the projections of the cube in 2 or 1-dimensional space. Capital letters indicate summation or marginalization and small letters indicate specific values of the variable.

    It is also interesting to note from this equation, especially the way it is represented that it is fairly easy to incorporate additional covariates.


6.5 SNP Detection

The SNP detection algorithm in Strand NGS is capable of detecting three types of variants, namely substitution (or mutation), deletion and insertion. While reporting, these variants are categorized into four events namely substitution, insertion, deletion and complex. Substitution consists of one or more substitutions occurring in consecutive locations; similarly deletions and insertions comprise of events where one or more deletions or insertions, as the case may be, occur together. A deletion event is modeled as a mutation from a nucleotide value $ \{$A, T, G, C$ \}$ to ``gap". Complex events are reported when there is a mixture of multiple types of variants occurring together in consecutive locations.

The SNP detection algorithm works on a per-sample basis. For each sample, data for each chromosome is analyzed, with a capability for parallel computation per chromosome. The computation is optimized for memory usage by segregating the data into windows based on maximum read size.


6.5.1 Pre-processing: Selection of Potential SNP Sites

SNP identification is only performed at sites determined by pre-processing to be likely SNP locations. Locations are only considered potential SNP sites if they satisfy all the following requirements:

  • Read coverage threshold: The number of reads covering the location must exceed a user-defined threshold.
  • Variant coverage threshold: The number of reads covering the location and differing from the reference at the location, must also exceed a user-defined threshold.
  • Homopolymer effects: In some technologies, reads covering homopolymers and adjacent nucleotides are more prone to errors in sequencing and alignment.
    • Homopolymer threshold: Homopolymers are determined by sequences repeating a single nucleotide value more than a threshold number of times; this is set by the user-defined homopolymer threshold. Currently, SNPs are not called at locations within a homopolymer stretch.
      Figure 6.8: SNP calling close to a homopolymer
       
      homopolymer_SNP_1.png
      In Fig. 6.8, the column shaded in pink lies at the end of a homopolymer of length 7 bases. At this position, both the Read Coverage (11) and Variant coverage (3 Ts and 2 gaps = 5 variants) satisfy the related default thresholds of 10 and 2 respectively. However, as the position lies within a homopolymer it will not be considered for SNP calls.

    • Spillover threshold: If the length of the homopolymer exceeds a certain user-defined spillover threshold, bases at locations immediately adjacent to the homopolymer and which are same as that present in the homopolymer are discarded.
      Figure 6.9: Heterozygous SNP adjacent to a homopolymer
       
      homopolymer_SNP_2.png
      The examples given in Figures 6.8 and 6.9 illustrate this point. The column shaded blue in each case is adjacent to a homopolymer of length 7. This position is not within a homopolymer, and has sufficient read coverage as well. However, in Fig. 6.8, the variant coverage is only 1, contributed by the gap. The two reads with nucleotide G do not contribute to the variant count, as they are considered due to the spill-over effect of the adjacent homopolymer with 7 consecutive Gs. On the other hand, in Fig. 6.9, the variant coverage is 5, caused by 1 gap and 4 Cs.


6.5.2 Bayesian SNP Calling Algorithm

At every location declared significant in the pre-processing step, the Bayesian SNP calling algorithm is applied to identify the genotype at that location, and report the SNP if detected. The algorithm considers the nucleotide or the base taken by each read covering the location, as well as its associated base quality and finds the consensus genotype. The consensus genotype is the one most likely to have caused the observed data and is computed using Bayes principle.

If the observed data is collectively denoted by $ \cal D$, then the task is to infer the consensus genotype from $ \cal D$. SNPs are then detected by comparison to the reference sequence. The ploidy of the sample determines the number of nucleotide inferences necessary to infer the underlying genotype $ \hat {\cal G}$. As Strand NGS currently only handles diploid organisms, the number of nucleotides to be concluded (the length of vector $ \hat {\cal G}$) is two.

The algorithm selects the genotype $ \hat {\cal G}$ that maximizes the posterior probability, $ P(\cal G \vert\cal D)$, defined by the probability of a genotype $ \cal G$ given the observed data. In other words, the genotype concluded is that most likely to have caused the particular observations

$\displaystyle \hat {\cal G} =$   argmax$\displaystyle _{\cal G}P({\cal G} \vert{\cal D}).$

This is easily computed under Bayes principle as
$\displaystyle P(\cal G \vert \cal D)$ $\displaystyle =$ $\displaystyle \frac{P (\cal D \vert \cal G) \cdot P(\cal G)}{P(\cal D)}$  

where $ P(\cal G) $ is the prior probability of genotype $ \cal G$ and $ P(\cal D)$ is the likelihood of data, calculated over all possible genotypes,
$\displaystyle P({\cal D}) = \sum_{\forall {\cal G}_i}P({\cal D}\vert{\cal G}_i)\cdot P({\cal G}_i).$      

The conditional probability $ P (\cal D \vert \cal G)$ represents the probability of observing the data $ \cal D$ under the given genotype $ \cal G$. (See below for the details of the calculation of the prior and the conditional probabilities.)

Combining the above-mentioned equations, we arrive at the consensus genotype, given by

$\displaystyle \hat {\cal G} =$   argmax$\displaystyle _{\cal G}\frac{P (\cal D \vert \cal G) \cdot P(\cal G)}{\sum_{\forall {\cal G}_i}P({\cal D}\vert{\cal G}_i)\cdot P({\cal G}_i)}$      

Strand NGS specifies a score to represent the confidence with which a SNP is called at a locus. This score is a function of the posterior probability of the locus being the reference genotype $ RR$ given the observed data and is defined as

$\displaystyle Score=-10 log_{10} (P({\cal G} = RR \vert \cal D)).$    

If the consensus genotype is not the reference genotype, and if the above score is greater than the user-specified cut-off, then the location is reported as a variant location and the alleles corresponding to the consensus genotype which are different from the reference are reported as the variant alleles.

Calculation of the prior probabilities $ P(\cal G) $:
We calculate the prior probabilities of different genotypes at a given location by taking into account the following parameters. The approach is similar to that in [43].

  • The reference base at the location
  • Heterozygosity: This is representative of the prior expectation of a position being heterozygous in the organism. The default value is set to $ 0.001$ which correlates with the population SNP rate for the human genome.
  • Homozygous to heterozygous ratio: This is the prior expectation of the ratio of the number of homozygous mutations to the number of heterozygous mutations. It has been observed that heterozygous variants are twice as frequent as homozygous variants in any genome. Hence the default value is set to $ 0.5$.
  • Indel to substitution ratio: This is the prior expectation of the ratio of the number of indels to the number of substitutions. Based on the information from the literature, the default value is set to $ 0.125$.
  • Ti/Tv ratio: This is the prior expectation of the ratio of the number of transitions to the number of transversions. A transition is a point mutation that changes a purine nucleotide (A or G) to another purine, or a pyrimidine nucleotide (T or C) to another pyrimidine, whereas a transversion refers to the substitution of a purine for a pyrimidine or vice versa. Although there are twice as many possible transversions, because of the molecular mechanisms by which they are generated, transition mutations are generated at higher frequency than transversions. Expected value of this ratio for genome wide variations is in the range $ 2.0 \textrm{-} 2.1$, whereas it is in the range $ 3.0-3.3$ for the exonic variations. The default value for this parameter is set to $ 2.6$ to make it suitable for both whole genome and exome data analysis.

The tables below give the prior probablities computed for the default values of the parameters mentioned above. Table 6.1 gives the probabilities assuming the reference base is $ A$, whereas Table 6.2 gives the same when the reference base is a gap (this corresponds to an insertion in the read). In these tables, the rows and the columns correspond to the two different alleles of the genotype. Thus, for example, the prior probability for the genotype $ AT$ is given in row $ A$ and column $ T$. Of course, you will find the same value in row $ T$ and column $ A$ also.


Table 6.1: Prior probability of genotypes of a diploid genome when Ref = A
  A C G T -
           
A 9.984 $ \times 10^{-1}$ 2.174 $ \times 10^{-4}$ 5.652 $ \times 10^{-4}$ 2.174 $ \times 10^{-4}$ 6.250 $ \times 10^{-5}$
C 2.174 $ \times 10^{-4}$ 1.087 $ \times 10^{-4}$ 1.229 $ \times 10^{-7}$ 4.726 $ \times 10^{-8}$ 1.359 $ \times 10^{-8}$
G 5.652 $ \times 10^{-4}$ 1.229 $ \times 10^{-7}$ 2.826 $ \times 10^{-4}$ 1.229 $ \times 10^{-7}$ 3.533 $ \times 10^{-8}$
T 2.174 $ \times 10^{-4}$ 4.726 $ \times 10^{-8}$ 1.229 $ \times 10^{-7}$ 1.087 $ \times 10^{-4}$ 1.359 $ \times 10^{-8}$
- 6.250 $ \times 10^{-5}$ 1.359 $ \times 10^{-8}$ 3.533 $ \times 10^{-8}$ 1.359 $ \times 10^{-8}$ 3.125 $ \times 10^{-5}$



Table 6.2: Prior probability of genotypes of a diploid genome when Ref = -
  A C G T -
           
A 7.813 $ \times 10^{-6}$ 2.441 $ \times 10^{-10}$ 2.441 $ \times 10^{-10}$ 2.441 $ \times 10^{-10}$ 1.563 $ \times 10^{-5}$
C 2.441 $ \times 10^{-10}$ 7.813 $ \times 10^{-6}$ 2.441 $ \times 10^{-10}$ 2.441 $ \times 10^{-10}$ 1.563 $ \times 10^{-5}$
G 2.441 $ \times 10^{-10}$ 2.441 $ \times 10^{-10}$ 7.813 $ \times 10^{-6}$ 2.441 $ \times 10^{-10}$ 1.563 $ \times 10^{-5}$
T 2.441 $ \times 10^{-10}$ 2.441 $ \times 10^{-10}$ 2.441 $ \times 10^{-10}$ 7.813 $ \times 10^{-6}$ 1.563 $ \times 10^{-5}$
- 1.563 $ \times 10^{-5}$ 1.563 $ \times 10^{-5}$ 1.563 $ \times 10^{-5}$ 1.563 $ \times 10^{-5}$ 9.999 $ \times 10^{-1}$


Computation of the conditional probabilities $ P({\cal D}\vert{\cal G})$:
The computation of $ P({\cal D}\vert{\cal G})$ takes into account the probability of errors observed in the data under each genotype $ \cal G$. It is assumed that the error in the observation of any base is independent of observation error in any other base. In most cases, error rates are derived from the associated base qualities as follows.

$\displaystyle \epsilon = 10^{-(\text{base quality})/10}$    

The error rates may be adjusted by taking the mapping quality into consideration. A mis-aligned read creates unreliable results irrespective of the base-qualities. Therefore, Strand NGS has an option to take into account the mapping qualities whenever provided by the aligner and uses the minimum of the read mapping quality and individual base quality in place of base quality for its computations.

In the absence of error rates, hard-coded error rates (that are technology-specific) are used. If error rates for a particular technology are not present in Strand NGS, it assumes a quality score of 30 for the base, which is interpreted as an error rate of 0.001.

If the numbers of $ A$s, $ T$s, $ C$s, $ G$s, and $ -$s in the reads at the locus being considred are $ n_{a}, n_{t}, n_{c}, n_{g},$ and $ n_{-}$, respectively, then the conditional probability $ P({\cal D}\vert{\cal G})$ can be broken down as follows.


$\displaystyle {P({\cal D}\vert{\cal G})}$
  $\displaystyle =$ $\displaystyle k \left[P(B=A \vert {\cal G})\right]^{n_{a}} \left[P(B=T \vert {\...
...P(B=G \vert {\cal G})\right]^{n_{g}} \left[P(B=- \vert {\cal G})\right]^{n_{-}}$  

where $ B$ is the read base at the locus under consideration, and $ k$ is the number of distinct permutations of the different read bases at that locus, and is equal to $ \frac{(n_{a}+n_{t}+n_{c}+n_{g}+n_{-})!}{(n_{a}!)(n_{t}!)(n_{c}!)(n_{g}!)(n_{-}!)}$. Note that this factor will be present in both the numerator and the denominator in $ P({\cal G}\vert{\cal D})$. So the final result doesn't require this to be calculated.

If $ {\cal G}$ is homozygous, the probability of observing each of the bases can be calculated as follows.

$\displaystyle P(B=b \vert {\cal G} = <g,g>)= \begin{cases}1-\epsilon, & \text{i...
... \frac{\epsilon}{3}, & \text{if $b$ is not present in ${\cal G}$} \end{cases}$    

where $ b$ is the observed base.

For heterozygous genotypes $ {\cal G} = <g_1,g_2>$, the probability is calculated as follows.


$\displaystyle {P(B=b \vert {\cal G} = <g_1,g_2>)}$
  $\displaystyle =$ $\displaystyle {\frac{1}{2} \times P(B=b \vert \text{Read came from chromosome 1})+ \frac{1}{2} \times P(B=b \vert {\text{Read came from chromosome 2})}}$  
  $\displaystyle =$ \begin{displaymath}\begin{cases}
\frac{1}{2} (\frac{\epsilon}{3}) + \frac{1}{2} ...
...{3}), & \text{if $b$ is not present in ${\cal G}$}
\end{cases}\end{displaymath}  

6.5.3 Further Notes on SNP Detection

  • Calling out Multi Base Variants MNP identification currently does not use a different algorithm; adjacent SNPs are grouped together to arrive at the locations and then underlying MNP patterns detected from the reads spanning these locations. Note that the associated score is the minimum of the scores for all individual SNPs comprising the MNP.

  • Calling Single-deletions Single-deletions are also output as part of the SNP algorithm. This is due to the fact that aligners introduce ``gaps", representing deletions into reads for better mapping. Thus at each position a read may take one of 5 values: A, T, G, C, - (gap). The gap is treated exactly as a nucleotide value and in case of a preponderence of gaps at a location, a deletion is called. Since gaps do not have a quality score associated with them, the average of the surrounding base qualities is used for the computation.

  • Calling Insertions The problem with calling out insertions is that insertions can be of any length and there might be more than four different types of variants at a single locus (which is always between two reference bases) unlike in the case of substitutions. Hence the normal method of variant detection cannot be applied in this case.

    Here is what is done: All the insert sequences are grouped into a single type and the probability that the location is a variant is calculated. If the score computed from this probability is above the specified cut-off, a variant is declared. Then the top two insert sequences are taken and the final genotype is the combination of these insert sequences that gives the best score.

6.6 Low Frequency SNP Detection

For detecting low frequency variants, we use an adaptation of the standard binomial SNP calling method to take the base qualities into account. Similar to the Bayesian approach, this method considers the base and its associated quality (denoted collectively by D) to assess the presence of the variant at a particular location. This is done by computing the probability, p, of observing a data D' with the same total number of bases and at least as many variant bases as D has, with the assumption that the location has the reference genotype, and that the variants are due to sequencing error only. Only the variant allele with the highest frequency of occurrence is considered for the computation of the probability p and the other variant alleles are ignored. This probability is then converted into a score (s) as

$\displaystyle \centering s =-10 log_{10}p$    

If this score is greater than 50, then the location is reported as a variant location.

The standard binomial test for calling SNPs requires one overall error rate and cannot use the qualities of all the individual bases. One way to arrive at the overall error rate from the base qualities in D is to compute the error rate corresponding to the lowest base quality from D. However this may make the results very sensitive to any outlier bases with very low quality which may lead to false negatives. In order to avoid this problem, we ignore all the bases with qualities less than a threshold $ Q_{t}$, and run the binomial test with the error rate corresponding to the lowest quality from the remaining bases. The binomial test is calculated as

Image binomial_eqn

where for a particular location at a certain $ Q_{t}$, m is the number of reference bases, n is the number of variant bases , $ \epsilon$ is the error rate and s is the specified cut-off score.

To make the test more robust, this process is repeated for multiple values of $ Q_{t}$ and a variant is reported if it is called for any of the thresholds $ Q_{t}$. Typically $ Q_{t}$ is varied from 20 to 30 and if a variant is called in any of these tests, the method reports the highest score and the corresponding $ Q_{t}$ . In case different variants are called at the same locus in these multiple tests, only the one with the highest score is reported.


6.7 SNP Effect Analysis

SNP Effect Analysis investigates a list of SNPs and generates a report indicating the effect that these SNPs have on the genes in a given annotation.

A transcript model has physical units called exons that are separated by introns. The transcript also has logical units called $ 5^{'}$ UTR, Coding region (Cd), $ 3^{'}$ UTR. The final protein is created by the coding regions and show no contribution from the UTRs (untranslated regions). The first base of the $ 5^{'}$ UTR is called the Transcription Start Site (TSS). The last base of the $ 3^{'}$ UTR is called the Transcription End Site (TES).

Figure 6.10: Transcript structure
 
transcript.png
The first two and last two bases of each intron are very important for splicing. These sites are called Donor and Acceptor sites. Fig. 6.10 shows a transcript structure with the important components.

The coding region is sequence of codons - each of which is 3 nucleotides long. The length of the coding region is therefore a multiple of 3. The first codon in the coding region is called the start codon and the last is called the stop codon. Each codon is translated into an amino acid.

Classifying mutations according to annotation
Mutations could occur anywhere in the chromosome. As their functional impact depends on how they affect protein formation, they may be classified accordingly.

  • INTERGENIC Mutation: A mutation that does not fall within the neighborhood of any gene in the annotation. (A neighborhood is defined based on a chosen value of $ \delta$.)

  • UPSTREAM Mutation: Mutation occuring upstream of the transcript, i.e., within co-ordinates $ [TSS-\delta,TSS]$ (for +ve strand) or $ [TSS, TSS+ \delta]$ (for negative strand)
  • DOWNSTREAM Mutation: Mutation occuring downstream of the transcript: say within genomic co-ordinates $ [TES,TES+\delta]$ (for +ve strand) or $ [TES-\delta,TES]$ (for negative strand)
  • INTRONIC Mutation: Mutations occuring in intronic regions.
    • ESSENTIAL SPLICE SITE Mutation: Mutations to the donor and acceptor sites of the intron (i.e. two bases into the intron from either side)
    • SPLICE SITE Mutation: Mutations to locations of splicing signals (i.e. 3-8 bases into the intron from either side, 1-3 bases into neighbouring exon).
  • 5 PRIME UTR Mutation: Mutation in the $ 5^{'}$ region.
  • 3 PRIME UTR Mutation: Mutation in the $ 3^{'}$ region.
  • Coding Region Mutation: Mutations in the coding region are of two types.
    • SYNONYMOUS CODING Mutations: Mutations that have no effect on the final amino acid sequence (e.g. change of TTT codon to TTC codon still results in the Phenylalanine(F) amino acid). These are called synonymous changes.
    • Mutations that are non-synonymous have an effect on the amino acid sequence and are classified accordingly.
Classifying non-synonymous mutations
Fig. 6.11 shows a sequence from the coding region of a transcript and its comparison against a reference. The sequence has been split into codons for ease of explanation. Above the reference sequence codons are written the corresponding amino acids from the codon translation table. Similarly amino acids corresponding to the sample sequence codons are written directly below it.

Figure 6.11: Mutations in a coding region
 
coding_reg_mut.png

The SNP detection algorithm correctly identifies SNPs at positions 6, 9 and 12. All involve a change from C to G. One can see that each of the C $ \rightarrow$ G SNPs has a different effect on the amino acid sequence. The first change at position 6 causes no change in the amino acid sequence and so is termed SYNONYMOUS CODING. The second mutation is termed NON SYNONYMOUS CODING since amino acid changes from H to Q. The final change is termed STOP GAINED since a stop codon has been introduced before the original stop codon in the reference.

Similarly, a change that mutated the reference stop/start codon would be classified as STOP LOST/START LOST.


6.7.1 Challenges in SNP Effect Analysis

In this section we discuss a few of the challenges that present themselves during SNP Effect Analysis.
  • Sequential processing of SNP reports: Effect Analysis typically processes SNPs sequentially, one by one. As long as the SNPs are single base pair mutations that do not occur too close to each other, the effect of each SNP can be individually considered. However, insertions, deletions, and multiple changes within a codon complicate the analysis.
    • Multiple SNPs close to each other: This can cause erroneous classification of mutations. For example, two SNPs in a codon taken together may result in a NON SYNONYMOUS CODING classification. However, when processed one at a time, each SNP may be considered as being a SYNONYMOUS CODING mutation.
    • Indels: Any indel which is not a multiple of 3 are marked FRAMESHIFT CODING as they cause a cascading change in the amino acid sequence. Indels that are multiples of 3 are considered NON SYNONYMOUS CODING mutations. However, if they appear as two separate entries in the SNP report, sequential processing could cause them to erroneously be reported as FRAMESHIFT CODING mutations.
    • Several types of variation occuring in a single MNP: If an MNP simultaneously exhibits more than one type of mutation, for example, an insertion and two SNPs, effect analysis becomes complicated. In this case, the effect is simply termed COMPLEX VARIATION.
    • Boundary region issues: If the locations involved in an MNP overlap boundary regions, so that part of the MNP would be classified in one way and part in another, the effect is again called COMPLEX VARIATION.


Table 6.3: Mutation classification based on region
Mutation Location Classification
Coding region START LOST
  STOP GAINED
  STOP LOST
  FRAMESHIFT CODING
  NON SYNONYMOUS CODING
  SYNONYMOUS CODING
  SPLICE SITE
Intronic ESSENTIAL SPLICE SITE
  INTRONIC
  SPLICE SITE
Genic regions 5PRIME UTR
  3PRIME UTR
  SPLICE SITE
Non-genic regions UPSTREAM
  DOWNSTREAM
  INTERGENIC
Miscellaneous EXONIC (Coding region information absent)
  NEAR GENE (Strand information for a gene absent)
  GENIC (Transcript information for a gene absent)
  COMPLEX VARIATION (MNP overlaps a boundary region
  OR two or more variations occur concurrently in an MNP)


  • Ploidy: Further complications arise from the fact that a single chromosome is not being sequenced. It is possible therefore to have two SNPs one from each copy of the chromosome. Considering the combined effect of the two SNPs would be erroneous in this case.
  • Missing information: If transcript annotations do not specify the coding region, the $ 5^{'}$ UTR, $ 3^{'}$ UTR, and Cds are not known. The only calls that can be made are - EXONIC (if it falls in an exon), INTRONIC, ESSENTIAL SPLICE SITE, SPLICE SITE, UPSTREAM, DOWNSTREAM, INTERGENIC.

    If strand information is not available for a particular gene, then UPSTREAM/DOWNSTREAM calls also cannot be made. Instead, these calls are termed NEAR GENE mutations.

    In case a gene is identified by the annotation but all transcript information for the gene is missing, intronic and exonic regions are not known. In this case, the call is GENIC.

Table 6.3 gives a list of mutation classifications based on the location of the mutation.

© Strand Life Sciences Private Limited
All rights reserved.