Analysis of the Computational Intelligence Techniques for Rna Sequence Data Analysis

Analysis of the computational intelligence techniques for RNA sequence data analysis

University:
Faculty:
Department:
Couse:
Author:
Tutor:
Year:

A dissertation submitted in part fulfillment of the requirements for the Bsc (Hon’s) at the Faculty of Science and Technology at the Deakin University

2016
Declaration
I certify that the attached work is entirely my own (or where submitted to meet the requirements of an approved group assignment is the work of the group), except where work quoted or paraphrased is acknowledged in the text. I also certify that it has not been submitted for assessment in any other unit or course.
I agree that Deakin University may make and retain copies of this work for the purposes of marking and review, and may submit this work to an external plagiarism-detection service who may retain a copy for future plagiarism detection but will not release it or use it for any other purpose.

Name: ………………………………………
Reg. No.: ……………………………………
Signature: ………………………………….
Date: ……………………………………….

Table of content

Declaration 2
Table of content 3
1.0 Introduction 4
1.1 Background of the Study 4
1.2 Statement of the problem 5
1.3 Objectives 5
2.0 Literature review 7
2.1 Computational methods 7
2.2 Computational intelligence techniques for RNA sequence data analysis 7
3.0 Methodology 16
3.1 Research design and data collection 16
3.2 Parameter choice 20
3.3 Data sets 21
4.0 Results and discussions 24
4.1 Discrimination between DE and non-DE genes 26
References 44

1.0 Introduction
1.1 Background of the Study
Computational intelligence technique is process employed with the help of algorithms for the analysis of the patterns of ribonucleic acid (RNA) sequence data. The intelligence is based upon the algorithms that analyze the RNA sequence. The artificial intelligence is also applied to make the algorithm to analyze the patterns of RNA sequence data to determine the predictive results. There are many computational intelligence techniques available for the analysis of the RNA sequenced data. The algorithms are structured in a manner that that help the application of heuristics by identifying the patterns in step by step formulation such as classification, grouping and clustering, minimizing the result with similarity score definition and finally devising the results.
The process of RNA sequencing is a difficult task that includes the complex functional formulation with the step wise analysis. The accuracy of analysis result depends upon the various factors such as extraction of RNA, fragmentation of extracted RNA, Sequencing first. These three steps are basically non computational processes followed one by one in sequential fashion to finalize the sequencing of RNA in a pattern that further the algorithm further proceeds to analyze the RNA. The computational intelligence is applied on sequenced RNA patterns to qualify the required quality. Further, various analysis is employed to get the final result.
On the other hand, computational biology is an interdisciplinary area that is devoted to interpret and perform the analysis of the biological information through the computational techniques. Computation biology includes the exhaustive research and findings based on biology, computer science, mathematics and statistics. This combined approach is applied step-by-step fashion form of computer algorithms to sequence the biological data, arrange the genome content and predict the structure of macro molecules such as RNA etc. The new techniques and tools are emerging regularly in the field of computational intelligence in biological data analysis. The advancement in biological data collection technique has provided the opportunity and challenge to algorithm designers in performing the analysis of high volume and complex data. Traditional methods of computing are very much limited with their functional scope to such complex, huge, multi-dimensional and noisy data. Due to the limitation the traditional methods are not able to provide the accurate report after the analysis. This is also true for the traditional algorithms and methods that the process involved in their methodology is manual and time consuming.
This dissertation focuses on the challenges faced by the traditional methods or experiments which are used for analyzing the behavior of about ten thousand genes under certain condition. This research aims to answer questions such as: How computational intelligence techniques or tools can be used in the field of biology or science? Which computational techniques are used in the RNA sequence data analysis process? How computational intelligence techniques prove beneficial for RNA-sequencing technique or approach? What types of tasks are performed by the computational intelligence techniques or tools in the process of RNA-sequencing?
1.2 Statement of the problem
Over the years, various traditional methods have been used in the analysis of RNA sequencing data. However, these methods have been insufficient, time-consuming, tedious and inaccurate. This dissertation reports on the various computational intelligence techniques which are used for decreasing the complexity of the tasks or activities involved in the RNA sequencing analysis. It focuses on the working of various computational intelligence techniques.
1.3 Objectives
The main objectives of this research project are:
i. To identify the various computational or artificial technologies that can be used in the field of biology or the process of RNA sequencing data analysis.
ii. To analysis the different computational intelligence techniques or tools used in the process of RNA sequencing.
iii. To get the knowledge about the effect of artificial or computational intelligence techniques in the process of RNA sequencing.
iv. To collect the information about the advantages of using computational techniques in the new technique of RNA sequencing.
v. To compare the various computational intelligence techniques for RNA sequence data analysis.

2.0 Literature review
2.1 Computational methods
The computational intelligence mechanisms are automated technique which combines the elements of learning, adaptation, evolution and logics and predicates to devise the analysis procedures. The complexities are formulated with the algorithmic steps and biological data is taken directly from the input sources like sensors and other input systems. It has the flexibility in the information processing capabilities to handle the large volume of real life data containing the noise, ambiguity, missing values. The problem solving with biological informatics generally involves the searching the useful regularities or pattern in huge amount of data from the multi-dimensional framework. This is fact behind the development of advanced pattern analysis approaches as the traditional methods often become intractable in such situation.
2.2 Computational intelligence techniques for RNA sequence data analysis
According to Shanrong et al. (2016, p. 15), the next generation technique decreases the cost of sequencing. This technique also has the problem related with the massive amount of data generated by vast scale RNA sequence. They devised the technique by multiple computational algorithms and the tools associated to run the algorithms in sequential fashion are automated through the intelligence inception. The tools associated with algorithms are open source that again makes it more popular to be enhanced by other methods (Shanrong et al. 2016, p. 15). The RNA-seqdata analyses and advanced web 2.0 technologies framework advances the functional efficiency of the technique. The implemented version of tool is QuickRNASeq that is a pipeline for huge RNA sequence data analyses and visualization. This defined tool has three steps for analyzing the RNA sequence. In first step individual sample is being processed with computation intensive fashion. The second brings the result of individual sample and a report is generated. Finally, at third step, data interpretation and presentation of the final RNS sequence analysis result is carried out (Shanrong et al. 2016, p. 15).
Wagle et al. (2015, p. 8) proposed a computational intelligent next-Generation Sequencing (NGS) tool ‘QuickNGS’ for the molecular biology. This tool has the multiple algorithmic approaches for the basic data analysis. This tool has the ability to analyze the data from multiple NGS projects at same time. This tool utilizes the parallel computing resources having a back-end database entity. A comprehensive analysis of 10 RNA sequence samples are taken and analyzed within a few minutes. This also takes the large number of samples with multiple projects at the same time and analyzes to provide the RNS sequence report (Wagle et al. 2015, p. 8).
Also, Sun et al. 2016, p. 11) proposed computational intelligent software tool called ‘RED’ (RNA Editing Site Detector) which identifies RNA editing sites through integration of multiple rule based and statistical filters. The RNS site is being visualized at genome and the site levels are visualized by Graphical User Interface (GUI) based display window (Sun et al. 2016, p. 11). This tool enhances the functional performances by integrating the MySql database engine for high level database throughput and queries processing. This has the ability to identify the presence and absence of C-> U RNA-editing sites that is experimentally validated in comparison to REDItool as it is command line tool for performing high output investigation of RNA editing. This also provides the better sensitivity and easy to use, platform independent java based software and applied to RNA sequence data without the presence of DNS sequence data (Sun et al. 2016, p. 11). Another RNA sequencing too known as 4SALE was proposed by Wolf et al. (2014, p. 145) 4SALE is a synchronous sequence and secondary structure alignment and editing technique. This tool enables one to align RNA sequences and their individual secondary structure synchronously and automatically. After that, they introduced a scale down Graphical User Interface (GUI) version of 4SALE tool for the big data analysis. This is widely accepted for the phylogenetic information discovery.
TreeSeq RNA sequence tool was developed by Wintermans et al. (2015, p. 11). This tool works with intelligence to quaternary tree search structure for the analysis of biological sequence data. The main beauty of this tool is that it employs a rapid search for the sequences of interest from large number of data sets. This tool inherent with the screen guts micro biota metagenomic dataset and a whole genome sequencing (WGS) dataset of a strain of Klebsiella pneumonia for antibiotic resistance. This tool is thirty times more faster that of B:AST and also the result is accurate in data sequence analysis (Wintermans et al. 2015, p. 11).
Zyprych-Walczak et al. (2015, p. 5) proposed a technique for high throughput sequencing for the RNA sequence. This technique employs the statistical and computational methods which tackles the analysis and management of biological sequenced data. They provides a comprehensive comparison of five normalization methods related with sequencing depth by suggesting a common workflow which is being applied for the selection of optimal normalization procedures of any type of pattern of dataset (Zyprych-Walczak et al. 2015, p. 5) Statistically, the computational algorithms by this technique calculate the bias and variance values for the control gene. This gives the suitable normalization method to studied data set and finally determines the method can be employed interchangeably.
An R package RNA sequencing tool known as miRComb has been developed by Vila-Casadesús et al. (2016 p. 17). This technique combines miRNA expression data with hybridization information to find out potential miRNA-mRNA. There is pipeline constructed for the main output. The output results may be used to a huge numbers of testable hypotheses proposed by other authors in this domain. The computational steps are to first filter the high amount of miRNA-mRNA interactions obtained from the existing miRNA target prediction database and then presents by standardized form such as in PDF report form. Yejun et al. (2015, p. 14) proposed an empirical method that is combined with empirical tests to determine the transcript features in association with transcriptional start sites such as TSSs, transcriptional termination sites such as TTSs and operon organization . They obtained 2764 TSSs and 1467 TTSs for the 1331 and 844 different genes respectively. The result of this technique shows that directional RNA sequence can be used to detect transcriptional borders at acceptable resolution. The computational algorithms employed through their proposed method or technique based on the transcript border detection, statistical models and operon organization pipeline. This technique is widely applied to study the RNA sequence in other bacteria as TSSs, TTSs, operons, promoters and un-translated regions (Yejun et al. 2015, p. 14).

Boley et al. (2014, p. 342) described an automated pipeline technique to genome annotation which integrates RNA sequence and gene boundary data sets. This computational technique having the tool is called Generalized RNA Integration tool or GRIT. This tool analyzes the gene expression and site sequence data collected. This annotation based method is optimized by the way of pipelining the steps of analysis through the various steps. The report is obtained through the automated system after the analysis (Boley et al. 2014, p. 342). Irla et al. (2015 p. 22) applied a technique in which two different cDNA library preparation method are taken. The one method characterizes the whole transcriptome and another one includes enrichment of primary transcript 5-ends. The computational algorithms are employed with two different baselines. One algorithm estimates the whole transcript and another follow to primary transcript only. The exact TSSs positions were taken and utilized to determine the conserve sequence motifs for translation start sites. Finally, the analysis results give the operon structure (Irla et al. 2015 p. 22).
Sturgill et al. (2013, p. 32) proposed a technique by using a series of quantitative and qualitative filters through the computer algorithms. The diagnosed errors are eliminated and then RNA sequence data are applied onto the simulation. The method is used commonly for the RNA sequence to identify the known alternative splicing events of determination (Sturgill et al. 2013, p. 32). The software package based on their method is called Splicing Analysis Kit (Spanki). This package is easily available and can be downloaded from the various web portals. The main advantage of this software tool is to better understand the error profiles in RNA sequence data and then improve the influence from this new technology (Sturgill et al. 2013, p. 32). An J et al. (2014, p. 275) developed a user friendly plant miRNA tool called miRPlant which takes 16 plant miRNA datasets from four different plat species and gives 10 percent more accurate result as compared to miRDeep-P. miRDeep-P is one of the most popular plant miRNA prediction tool. There is a graphical user interface for the data input and output with the miRNA tool that supports more interactive input output interfaces to the users. The visual parameters are also good with characteristics of color based output for the pattern data sequences.
Trapnell et al. (2009, p. 1105) have developed a protocol for sequencing the messenger RNA in the cell which is call RNA-seq that generates the millions of short sequence fragments through the single execution. These fragments are used to measure levels of the gene expression to identify novel splice variants of genes. The current version of software tool with this protocol to align RNA-seq data with a genome relies on the known splice junctions and cannot identify novel ones. The algorithm of read mapping is TopHat which is designed to align reads from an RNA-seq experiment to a reference genome without relying on the known splice sites (Trapnell et al. 2009, p. 1105). The pipeline included with this protocol software tool is much faster than any previous systems for RNA sequence data analysis. An standard desktop computer can also be used to analyze the RNA sequence data with this software based tool. This software tool is free available and open source in nature. Goecks et al. (2010, p. 86) have proposed a web based software platform for genomic research evaluation. This web based tool automatically tracks and manages the data prevalence and provides support for capturing the context and intent of computational methodologies. The web pages of this web based software platform are interactive, and hold the documentation for the supports (Goecks et al. 2010, p. 86). These documents also support the complete computational analysis process involved with this web based software platform.
According to Langmead et al. (2009, p. 25), a ultra-fast memory efficient program can be used to align short DNA sequence reads to large genomes. Through this software tool 25 million reads are taken by each Central Processing Unit of computers with memory footprints of approximately 1.3 gigabytes. This program has an inherent technique of quality aware backtracking algorithm which permits mismatches. The multi-processor cores can also be used simultaneous execution of program to achieve more alignment. This software program is also a open source software tool for the gene research in biological sciences. Bullard et al. (2010, p. 56) have developed a sequencing technological tool such as IIIumina Genome Analyzer for investigation of wide range of biological and medical problems. This tool has the integrated approach of statistical and computational approaches to get the meaningful and accurate conclusions from the massive and complex datasets generated by the sequences. The test strategies begin with the counting of genes (Bullard et al. 2010, p. 56). The result by this tool is affected by the features of sequencing platform such as length of varying gene, base calling calibration method and flow preparation. This tool has also quintile based normalization procedures and also demonstrates an improvement of detection. Due to lack of characterization of gene for RNA sequencing further research is suggested by them to make advancement in their proposed tool (Bullard et al. 2010, p. 56).
Maji et al. (2014, p. 26) found that TopHat v2.0.8 tool is more accurate in result and also performs the computational analysis very fast. The usage of CPU, memory footprint and execution time during the spliced alignment with their design PVT as pipelined version of TopHat removes the redundant computational steps during the spliced alignment. After that this breaks the job into a pipeline with multiple stages to enhance the utilization of resources. Hence, this tool reducing the execution time, processing time and maintains the functional efficiency and provides the much accuracy in results timely. Torres et al. (2014, p. 2204) have developed PRADA (A pipeline for RNA-sequencing data analysis) tool which is flexible, modular and also highly scalable in nature. This tool is basically a scalable software platform which gives different types of information available by multi-faceted analysis starting from raw paired-end RNA-seq data, gene expression levels and quality metrics. The detection is unsupervised and supervised fusion transcripts. The implemented algorithms under the PARDA have dual mapping strategy which increases sensitivity and refines the analytical endpoints.
According to Bacci et al. (2014, p. 45), a RNA reads trimming software tool named as StreamingTrim written under the Java application programming is an appropriate tool. This software tool is able to analyze the quality of RNA sequences in fast files and to search for low quality zones in a very conservative pathway. The main aim of this tool to be developed as to be capable of trimming amplicon library data, retaining taxonomic information as much as possible (Bacci et al. 2014, p. 45). There is graphical user interface where this software tool is equipped that gives the user friendly interfaces for usage (Bacci et al. 2014, p. 45). StreamingTrim reads and analyze the sequence one by one form input fast file without storing or keeping anything in the memory. This is compatible to run on the desktop computer system and also with laptop computer. The trimmed sequence output is stored in an output file that make more efficient for later usability of the output taken from this tool (Bacci et al. 2014, p. 45).
Sigurgeirsson et al. (2014, p. 631) have proposed a technique to compare the various tools for the RNA sequence analysis. They defined the different methodologies such as automated execution, accuracy and computational time for the various RNA sequence analyzer software based tools. The illumiaHiSeq 2000, TropHat2, STAR, RiboMinus and RiboZero are compared with their properties and characteristics of computational and functional procedures developed with the computational algorithms in step wise fashion of execution. The sequence analysis procedures, input and output procedures, accuracy, time consumption strand and non-strand criteria during the run optimization, quality metrics of results, availability and source code characteristics are mentioned in their research details of RNA sequence data analysis through the computational intelligent software programs and applications. Ujjawal et al. (2010, p. 23) elaborates the recent advances of integration of computational intelligence and pattern analysis techniques. The techniques are either independent or hybrid in nature by mixing various domains such as computing, mathematics, statistics and probabilities to analyze the sequences of biological things such as RNA etc. The biological data like sequence, structure and micro array data are considered as the data types and typically complex in nature so that these data require advanced methods to deal. The algorithms and method employed with algorithms are defined with its input and output (Ujjawal et al. 2010, p. 23).
Finally, Stefan et al. (2012, p. 76) has proposed the review of the computational methods and techniques which is employed with the noncoding RNAs. The computational intelligence employed is from the basic to advanced techniques to predict RNA structure, annotations of noncoding TNAs in genomic data, mining and extracting RNA sequence data for good and novel transcripts and predicting the transcript structure through the database resources as an search and match evaluation.

3.0 Methodology
3.1 Research design and data collection
This work utilizes a descriptive research design and uses a survey technique to collect data and examine research questions and the problem. The analysis was carried out using different RNA-seq techniques. The analysis is then used to test the accuracy of the data based on the analysis using the various techniques. The analysis aims to determine the relationship. The validity and reliability of the data are ensured during collection of the data. The data was collected using secondary data collections methods. That is, from already existing sources. Questionnaires were administered to five biological RNA research firms in the United States where respondents were supposed to provide data on the preferred RNA-seq technique used in the firm. In each firm, five respondents were selected randomly and issued with the questionnaire. Some of the techniques in which the respondents provided data for analysis are shown below. Out of the data provide, 20% of it has been used for RNA sequencing analysis in this work. Various computational intelligence techniques and the data were mentioned in the questionnaires as having been used been in various tasks in the process of analysis of RNA sequence. The computational intelligence methods used in this work are:
· Bagging Support Vector Machine: Bagging support vector machine is helpful in the classification of data (RNA-seq data). Bootstrap technique is used for this purpose by the bagging support vector machine.
· DESeq2 package is used for the analysis of RNA data. It detects the genes which show different behavior. Various models or methods are used for the analysis of data.
· MLseq package uses various techniques like normalization for the analysis of RNA-seq data. Two normalization methods are used by this package for the analyzing of data. Various algorithms are used in this package for the sequencing of data in RNA-seq technique.
· Transcriptator which is an automated computational pipeline is also used in the RNA-seq approach or technique.
In this section, brief overviews of the methods for differential expression analysis which were evaluated and compared are given. The techniques take their starting point in a count matrix which contains the number of reds which map to each other in each of the sample data provided in the experiment. Most of the computational techniques work directly on the count data whereas the some two methods transform the count and feed the transformed results into the R package limma, originally developed for differential expression analysis of microarray data. The methods which works directly on the count will be broadly divided into parametric bayseq, EBseq, ShrinkSeq, edgeR, DESeq, NBSeq and TSPM and non-parametric methods (NOISeq and SAMseq (Bacci et al. 2014, p. 45). The two-stage Poisson model (TSPM) are based on a Poisson model for the counts extended via an approach (quasililelihood) to allow for overdispersion. Therefore, the first step was to test the genes separately for the evidence of overdispersion. This was essential in deciding which models to use for the computational and differential analysis. The tests were based on asymptotic statistics. This implied that the total count of each gene across the sample data provided must not be too small. It is therefore recommend that those genes with a total of count less than 10 may be removed from the analysis. It is also noted that for the computational analysis to work well, it is important that there is some genes for which there is no overdispersion. The parametric models (baySeq, DESeq, edgeR, NBPSeq and EBSeq) used instead of the negative binomial model which account for the overdispersion while the ShrinkSeq allows the user to choose among different contributions including the NB and a zeroinflated NB distribution. The DESeq, edgeR and NBPSeq takes a classical hypothesis testing approach in whereby the baySeq, EBSeq and ShrinkSeq are cast within a Bayesian framework. The crucial part of the interference methods is to obtain reliable estimate of the dispersion parameter for the genes and therefore, a considerable effort is needed into the estimation. Because the long sample size who we have, the RNA-seq experiments computational analysis will be quite complex during the gene-wise dispersion parameters. This technique motivates information sharing across the genes in the data set. Both DESeq, edgeR and NBPSeq incorporate information sharing in the computational dispersion estimation and the information is sharing is done accounts for the differences in the three techniques (Bacci et al. 2014, p. 45).
Therefore, the first idea will be to assume that all genes have the same dispersion parameters. This will then be estimated from the data using a conditional maximum likelihood approach. A computational dispersion for all genes will be a too restrictive assumption and therefore the procedure is developed to allow for gene-wise computational dispersion estimates. The individual estimates will be squeezed towards the coon one using a weighted likelihood approach. In contrast, DESeq and NBPSeq will obtain a computational dispersion estimate by modeling the observed mean variance relationship for the genes in the data set using either local regression or parametric regression. After obtaining the fitted values, DESeq will take a conservative approach in which the definition of the conservative genes will be provided as the largest of the values obtained from the fitting and the individual dispersion estimates for the genes (Sigurgeirsson et al. 2014, p. 631).
However, NBPSeq does not take the same type of conservative approach as observed in DESeq and instead uses fitted dispersion values only. After getting an estimate of the mean and dispersion parameters for the genes, edgeR, DESeq and NBPSeq test for significant differential expressions using a variant of an exact test or a generalized linear model which allows more complex computational designs will be conducted. The approach to be used for the baySeq and EBSeq will be similar to the other three methods mentioned regarding the NB model. However, they methods will differ regarding the inference procedure. For example, for the baySeq, the user will be allowed to define a collection of models which are essentially for partitioning of the of the samples into groups. Samples in the same group will be assumed to share same parameters of the distribution (Sigurgeirsson et al. 2014, p. 631). Within a Bayes framework, baySeq will be used to estimate the probability of each model for the genes in the data set. Information for the data set of genes will be used to form an empirical prior computational distribution for the parameters that a shared between all the genes and estimates from the data. ShrinkSeq will be used to support some different count models such as NB and a Zero-inflated NB. This provides shrinkage of the computational dispersion parameter and also of the other parameters including the regression of coefficients that are of interest for the inference. It also will incorporate a step for refining the priors and the posteriors nonparametrically after we fit the model for each feature.
The two non-parametric methods which will be evaluated here, NOISeq and SAMseq, will not assume any given distribution for the data. The SAMseq is based on the Wilcoxon statistic, which is averaged over several samplings of the data, and uses a sample computational permutation strategy for estimating a false discovery rate for the different cutoff values for this statistic. The estimates will then be used to define q-values for each gene. NOISeq will be used to explore the distribution of fold-changes and expression differences between the two contrasted conditions in the observed data, and compare this computational distribution to the corresponding distribution by comparing pairs of samples belonging to the same condition.
We have focused on two-group comparisons only because this is the most common situation in practice. However, most computationally evaluated method also supports more complex experimental designs. Most of the computational techniques used in this study including edgeR, DESeq, NBPSeq, TSPM achieve the results through a generalized linear model framework, in which the user is allowed to specify desired contrasts to test (Sigurgeirsson et al. 2014, p. 631).
3.2 Parameter choice
Most of the techniques which are compared in this work allow the user to make a selection of the value of some parameters which can affect the results in some ways. In this work, we have used the default values which are provided in the implementation. Owe have also provided a comparison of the performance of different choices of the data values. This section, therefore, gives a summary of the parameter vales that were used in the evaluation of the work.
For edgeR, this work has used TMM method to calculate normalization of factors between samples. Tagwise dispersion estimates were squeezed towards the trended estimate computed through the moving average approach. An exact test was performed in finding genes which differently expressed between the two conditions. For DESeq, a pooled estimate was computed on the dispersion parameters for each gene. Computational Local regression was used to find the mean-variance relationships and where as a computational conservative approach of selecting the largest among the fitted value and as well as the fitted value and the individual computational estimate for each gene. The implementation exact test was used to find the DE genes. The local computational regression approach was also employed in the variance-stabilization transformation provided by the DESeq package (Ujjawal et al. 2010, p. 23).
For the TSPM, baySeq, voom and NBPSeq, the TMM technique was used to compute normalization factors. In the case of NOISeq, the counts were normalized using the TMM method before the data was fed into the computational differential expression analysis. The NBPSeq the NBP parametrization of the negative binomial distribution was conducted. For the baySeq, a negative binomial distribution was assumed and used the quasi-likelihood approach in the estimation of priors. A sample size of 5,000 was used in the estimation of the priors. An equal dispersion distribution was assumed for the genes in the two sample groups.
The BIC option for the prior re-estimation step was conducted on the data set. Before the application of the ShrinkSeq, the counts were normalized using TMM normalization factors. A Zero-inflated Negative Binomial distribution and shrinkage to the computational parameter as well as the computational regression coefficient of interest in the inference procedure was conducted. To make the results of the methods used comparable, a non-zero fold change was not imposed when estimating the false discovery rates (Ujjawal et al. 2010, p. 23).
3.3 Data sets

The evaluations in this work are based on synthetic data in which it was possible to control the settings and the differential expression status of each gene. The counts were generated for each gene from a computational negative binomial distribution with dispersion parameters and means estimated from real RNA-seq data. We now let represent the set of genes in the data set. In the synthetic data sets, it was taken . We also let to denote the set of samples which are assumed to be partitioned into two subsets and . In the computational analysis, we let and take as the “control” group of the samples and a group of samples having an abnormal phenotype (Bullard et al. 2010, p. 56).

Table 1: Summary of the parameters used in the estimation of the data sets

Sim. study

‘single’ outlier fraction
‘Random’ outlier fraction

0
0
0
0
0

1250
0
0
0
0

625
625
0
0
0

4000
0
0
0
0

2000
2000
0
0
0

0
0
6250
0
0

625
625
6250
0
0

0
0
0
10%
0

625
625
0
10%
0

0
0
0
0
5%

625
625
0
0
5%

We have also let denote the set of genes which have been differently expressed between the two sample groups. represents the set of genes down-regulated in compared to . The random variables which represent the count for genein sample were demoted as . The data was then modeled a computational negative binomial distribution following the following approach.

in this analysis, denotes the dispersion parameter controlling the level of overdispersion. Alo, is given by:

Where is the sequencing depth for sample and is defined as for whereas represents the conditions for the sample. The dispersion parameter be the same in the two sample groups (Bullard et al. 2010, p. 56).

4.0 Results and Discussions
Eleven methods for computational analysis of RNA-seq were investigated and evaluated in this study. It was established that nine of the techniques wok on the count data directly: DESeq, edgeR, NBPSeq, TSPM, baySeq, EBSeq, NOISeq, SAMseq and ShrinkSeq. The other two method combine data transformation with limma for computational analysis and therefore, in this dissertation, they are referred to as voom (+limma) and vst (+limma). A detailed description of the techniques has been given in the materials and methods section. The methods were evaluated based on the synthetic data, where the settings were controlled. Details regarding the simulations are also found in the materials and methods sections (Maji et al. 2014, p. 26).
Computational simulations (abbreviated as B), simulations were carried in all counts using computational negative binomial distributions with dispersion and mean parameters which were estimated from real data. In this work, the computational techniques used to obtain the dispersions in both conditions were assumed to be identical. The robustness of the methods against variations in the distribution of the input data has been evaluated through computational imposition of Poisson distribution for the counts for some of the genes. The simulation studies are denoted as ‘P’, or including outliers having abnormally counts and denoted as ‘S’ and ‘R’. the outliers are introduced with in two different ways. For example, for the ‘single’ outlier simulation studies (‘S’), 20% of the gene counts is selected where for each of these genes a single sample for which the observed count is multiplied with randomly generated factor between 5 to 20. On the other hand, for the ‘random’ outlier simulation studies, ‘R’, each observed count was considered independently with probability of 0.005 and multiplied by a randomly generated factor from 5 to 20.
The number of genes in each computational data set was 12,500, whereas the number of computationally and differentially expressed (DE) genes was set to 0, or 1,250 or 4,000. The composition of the DE genes was also varied. That is, the proportion of the DE genes that were up- and down regulated in one condition compared to the other. Finally, the computational the effect of increasing the sample size, from 2 to 5 or 20 samples per condition was achieved. These sample sizes were taken to reflect a range of experimental settings. Since, most current RNA-seq experiments show small sample sizes and the selection in the experimental design was between two to three samples per condition. Comparisons were also performed with 3 samples per condition (Maji et al. 2014, p. 26). The comparisons, however, contrasted with the results from 2 to 20 samples per condition. In addition to the computed data, we compared the techniques based on their performance for three actual RNA-seq data sets. Using the computational data, the following aspects of the methods under different computational conditions were studied.
The ability to rank actual DE genes ahead of non-DE genes was assessed in terms of the area under a Receiver Operating Characteristic (ROC) curve (AUC), and also in terms of false discovery curves, showing the number of false detections met while going through the list of genes ordered according to the evidence for computational expression. The ability to control false discovery rate ad type I error rate at an imposed level. This was assessed by computing the true false discovery rate and the observed type I error, respectively, among the DE genes at given significance levels (Torres et al. 2014, p. 2204). The computational time requirement for running the computational differential expression analysis was also conducted. For the actual RNA-seq data, the collections of DE genes by the different techniques, both in terms of their cardinalities and their overlaps were studied. The concordance of the gene rankings obtained through computational analysis was also studied.
4.1 Discrimination between DE and non-DE genes
The eleven computational methods were evaluated and helped to discriminate between actual DE genes and actual non-DE ones. A score for each gene using each method was computed and enabled ranking of the genes in order of significance and evidence for computational differential expression between the two conditions. Using the six methods which provide nominal p-values (edgeR, NBPSeq, DESeq, voom+limma, TSPM, vst+limma) it was possible to define the scores as 1-pnom. For the SAMseq method, the absolute value averaged Wilcoxon statistic as the ranking score was used whereas for EBSeq, ShrinkSeq and baySeq, the estimated posterior probability of computational differential expression equivalent to 1-BFDR, in which BFDR denotes the estimated Bayesian False Discovery Rate (Torres et al. 2014, p. 2204).
For the NOISeq, the statistic NOISeq was used as shown in the materials and methods section of this dissertation. It is clear that all the score are two-sided and that they are not affected by the direction of the computational differential expressions between the two conditions. It is therefore possible for one to choose to call all genes whose scores exceed the threshold DE, and all genes with scores below the threshold may be referred to as the true non-DE.
Considering the genes computed to be DE as the true positive group, it was possible to compute the false positive rate and the true positive rate for the possible score thresholds and the construction of the Receiver Operating Characteristic curves (ROC) for each method. The area under the curve (AUC) was treated as a measure of the discriminative performance of a method and was a reflection of the ability to rank the truly DE genes ahead of truly non-DE ones (Torres et al. 2014, p. 2204).

When the DE genes were upregulated, in condition as compared to condition, as shown in figure 1 A and 1c, there is a high variability in the computational results obtained using baySeq. The variability is reduced when the DE genes are regulated in different directions as shown in figure 1B and 1D. The evaluation of the effect of introducing non-overdispersion genes or outliers under the computational studies as shown in figure 1B was chosen. When the genes following a Poisson distribution was increased up to 50% (), the AUC increased majorly for the smallest sample size. Outliers with abnormally high counts show a reduced AUC for all methods but less for the transformation based methods. The results also show reduced AUC for the SAMseq than for the other methods (Bacci et al. 2014, p. 45).

Figure 1: Area under the ROC curve for the eleven computational methods in studies (fig. 1A) , (fig. 1B) , (fig. 1C) , (fig. 1D) , (fig. 1E) and (fig. 1F) . The boxplots shows the AUCs obtained across the 10 computed instances of each method. Each panel reveals the AUCs across the sample sizes 2, 5 and 10. The techniques are ordered according to their median area under curve for the largest sample sizes. After the DE were regulated in the same direction, the number of DE genes was increased from 1250 (A) to 40000(C) impaired the performance of all techniques. In contrast, after the DE genes were regulated in different directions (B and D), the number of DE genes show much less impact. It is noted that the variability of the performance of the baySeq is higher when all the genes are regulated in the same direction (A and C) as compared to when the DE genes are regulated in different directions (B and D). E and F shows introduction of outliers which decreased the AUC for most of the methods but less for transformation-based techniques.
Figure 2 gives representative false discovery curves for the same computational studies that were shown in Fig. 1. Since we are interested in the genes showing the greatest evidence of computational differential expression, the analysis was confined to the 1,500 top-ranked genes for each technique. It was noted that although NBPSeq was among the best techniques in terms of the overall ranking as shown in the highest AUC, in Fig. 1, it had challenges with false discoveries among the top ranked genes under many computational settings (Bacci et al. 2014, p. 45). In fact, while most of the false discoveries among the 1,500 top ranked genes were in parity with other techniques, there were some false discoveries ranked near the top by NBPSeq technique. TSPM and NOISeq also showed a tendency to rank some truly non-DE genes in the top.

For the computational simulation study in which half of the genes were produced according to a Poisson distribution, the presentation of the TSPM was improved and fewer non-DE genes were ordered near the top. The best performance, given in terms of ranking the true positives in the top, was obtained with the computational transformation-based techniques, voom+limma and vst+limma, and ShrinkSeq. It is clear that SAMseq performed well, although it returned the same score for both truly DE and truly non-DE genes. Larger sample sizes result in significantly fewer false positives observed among the top-ranked genes (Bacci et al. 2014, p. 45).

Figure 2: False discovery curves depicting the number of false positives found among the T top-ranked genes in the eleven computational methods. There were 5 samples per condition. Computational studies (fig. 2A) , (fig. 2B) , (fig. 2C) , (fig. 2D) , (fig. 2E) and (fig. 2F) . Some of the curves do not pass through the origin because the genes obtained the same ranking score.
In addition to the synthetic data set, the RNA-seq data set has been analyzed from 21 mice, 11 of the DBA/2J strain and 10 of the C57BL/6J strain. After filtering out the genes whose total count in the 21 mice was less than 10, the data set was found to contain 11,870 genes. The 11 computational methods were applied to get genes with computational differential expression between the two mouse strains. All genes were found to be DE at Bayesian FDR threshold of 0.05 and, therefore, were considered considerably DE. It was not clear how to fix a threshold for the q-value which is returned by NOISeq to match with the FDR or adjusted p-value from the other techniques, and hence NOISeq was not included from most of the succeeding analysis.

Figure 3: True false discovery rates (FDR) shown for an imposed FDR threshold of 0.05, for the nine techniques returning adjusted p-values, in computational (fig. 3A) , (fig. 3B) , (fig. 3C) , (fig. 3D) , (fig. 3E) and (fig. 3F) .With two samples per condition, three of the techniques vst+limma, voom+limma and SAMseq did not call any DE genes, and therefore, the FDR was undefined.

To begin with, comparison of DE obtained using each technique (fig. 4A) was conducted. The highest number of DE genes was obtained using ShrinkSeq, while baySeq returned moderately few. As can be seen in Figure 4A, TSPM, NBPSeq, edgeR, and the two transformation-based techniques found about the same number of DE genes (Bacci et al. 2014, p. 45).

B

Figure 4: Analysis of the Bottomly data set. (A) Number of genes observed to be significantly DE between the two mouse strains. (D) The average number of genes observed to be significantly DE genes after contrasting two subsets of mice of the same strain.
To further evaluate the performance of these computational techniques, they were subjected to the data set comprised of only the mice of the C57BL/6J strain, and two arbitrary sample classes of 5 samples each were defined. The analysis was repeated five times for random divisions. Under these conditions, it was expected that no genes are truly DE. However, most of the techniques found computationally expressed genes in at least one instance. For example, TSPM found by the largest number of DE genes (Fig. 4D), which is in support with our previous observation that this technique may be too liberal. By studying DE genes in the five instances, it was noted that the DE genes found by edgeR technique overlapped with those found by NBPSeq. Only few of the DE genes called by TSPM overlapped with the DE genes found by the other techniques. Also EBSeq showed a tendency to call unique genes that were not able to be found by any of the other techniques. The lack of agreement among the DE genes found by the different approaches may be an indication that they are false positives, and that the different techniques favor different types of patterns (Bacci et al. 2014, p. 45).

Table 2 Summary of observations

Technique
observation

DESeq
· Conservative with default settings.
· Is more conservative when outliers are introduced.
· Poor FDR control with 2 samples/condition
· Good FDR control for larger sample sizes with outliers.
· Generally low TPR.
· Medium computational time requirement which increases significantly with sample size.

edgeR
· Liberal for small sample sizes with default settings.
· More liberal when outliers are introduced.
· Poor FDR control in many cases, and worse with outliers.
· Medium computational time requirement which is largely independent of sample size.
· Generally high TPR.

NBPSeq
· Liberal for all sample sizes. It is more liberal when outliers are introduced.
· Poor FDR control, worse with outliers. Often truly
· Medium TPR.
· Medium computational time requirement, increases with sample size.

TSPM
· Highly sample-size dependent performance
· Very poor FDR control for small sample sizes
· Liberal for small sample sizes and unaffected by outliers.
· Many truly non-DE genes have smallest p-values
· Medium computational time requirement and is independent of sample size.

baySeq
· Have high variable results when the DE genes are regulated in the same direction.
· Less variability when all the DE genes are regulated in different directions.
· Poor FDR control with 2 samples/condition, good for larger sample sizes in the absence of outliers.
· Low TPR. Largely unaffected by outliers.
· Poor FDR control in the presence of outliers.
· It is computationally slow method, but allows parallelization.

EBSeq
· Has TPR which is relatively independent of both presence of outliers and sample size
· Poor FDR control in most situations which is moderately unaffected by outliers.
· Medium computational time requirement which increases with sample size.

NOISeq
· Performs well, in terms of false discovery curves
· Computational time requirement is dependent on sample size.

SAMseq
· Low power for small sample sizes.
· Largely unaffected by introduction of outliers.
· High TPR for large enough sample sizes.
·
Performs well also for simulation study with
· Computational time requirement is dependent on sample size

ShrinkSeq
· FDR control, but allows the user to use a fold change threshold.
· High TPR.
· Computationally slow, but allows parallelization.

5.0 Conclusion
In this dissertation, eleven computational techniques have been analyzed, evaluated and compared on RNA-seq data set. The summary of the findings and observations are given in table 2. There if no single technique among those evaluated is optimal and therefore, the techniques of choice depend on the conditions preferred. In this work, the techniques based on a variance transformations in combination with limma performed well under wide range of conditions and were relatively unaffected by outliers. These techniques were also computationally fast although they required at least three samples per condition to have sufficient ability to detect the genes. For the non-parametric computational technique SAMseq, among the top performing techniques for data sets consisting of large sample sizes, it required at least four to five samples per condition to have adequate ability to find DE genes. The same was true for ShrinkSeq technique which has an option for imposing change requirement in the inference procedure. For the parametric methods, this is due to the inaccuracies encountered in the estimations of the dispersion and mean parameters. In this study, TSPM emerged as the method most affect by the sample size and it was attributed to the use of asymptotic statistics. These results shows that the computational analysis of RNA-seq techniques suggests that the differently expressed genes which are found between small collections of samples are supposed to be interpreted with caution and that the true FDR may be higher than the selected FDR threshold. The DESeq, edgeR and NBPSeq are based on similar principles and revealed relatively similar accuracy in regard to gene ranking. Nevertheless, for large sample sizes, DESeq was overly conservative while edgeR and NBPSeq were too liberal as they called a larger number of both true and false DE genes.
References
Shanrong Zhao, Li Xi, Jie Quan, Hualin Xi, Ying Zhang, von Schack, David, Vincent and Michael Baohong Zhang 2016, QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization, BMC Genomics, Vol. 17, pp. 1-15.
Wagle, Prerana, Nikoli?, Milo and Frommolt, Peter (2015), QuickNGS elevates Next-Generation Sequencing data analysis to a new level of automation, BMC Genomics, Vol. 16, no. 1, pp. 1-8.
Sun, Yongmei, Li Xing, Wu, Di, Pan Qi, Ji Yuefeng, Ren Hong and Ding Keyue 2016, RED: A Java-MySQL Software for Identifying and Visualizing RNA Editing Sites Using Rule-Based and Statistical Filters., PLoS ONE. 3/1/2016, Vol. 11, no. 3, pp. 1-11.
Wolf Matthias, Koetschan Christian and Müller Tobias 2014, Review: ITS2, 18S, 16S or any other RNA — simply aligning sequences and their individual secondary structures simultaneously by an automatic approach, In Gene 10 August 2014, Elsevier B.V, vol. 546, no. 2, pp. 145-149.
Wintermans, Bastiaan, Brandt Bernd, Vandenbroucke-Grauls Christina and Budding Andries 2015, TreeSeq, a Fast and Intuitive Tool for Analysis of Whole Genome and MetagenomicSequence Data, PLoS ONE. Vol. 10, no. 5, pp. 1-11.
Zyprych-Walczak J, Szabelsk A, Handschuh L, Górczak K, Klamecka Figlerowicz M. K and Siatkowski I 2015, The Impact of Normalization Methods on RNA-Seq Data Analysis. BioMed Research International, vol. 2015, pp. 1-10.
Vila-Casadesús Maria, Gironella Meritxel, Lozano Juan Jose 2016, MiRComb: An R Package to Analyse miRNA-mRNA Interactions. Examples across Five Digestive Cancers PLoS ONE. 3/11/2016, Vol. 11, no. 3, pp. 1-18.
Yejun Wang, MacKenzie Keith D, White Aaron P 2015, An empirical strategy to detect bacterial transcript structure from directional RNA-seq transcriptome data, BMC Genomics, Vol. 16, no. 1, pp. 1-15.
Boley Nathan, Stoiber Marcus H, Booth Benjamin W, Wan Kenneth H, Hoskins Roger A, Bickel Peter J, Celniker Susan E and Brown James 2014, Genome-guided transcript assembly by integrative analysis of RNA sequence data, Nature Biotechnology, Vol. 32 no. 4, pp. 341-346.
Irla Marta, Neshat Armin, Brautaset Trygve, Ruckert Christian, Kalinowski Jorn, Wendisch Volker F 2015, Transcriptome analysis of thermophilic methylotrophic Bacillus methanolicus MGA3 usingRNA-sequencing provides detailed insights into its previously uncharted transcriptional landscape, BMC Genomics, vol. 16, no. 1, pp. 1-22.
Sturgill David, Malone John H, Sun Xia, Smith Harold E, Rabinow Leonard, Samson Marie-Laure, Oliver Brian 2013, Design of RNA splicing analysis null models for post hoc filtering of Drosophila head RNA-Seq data with the splicing analysis kit (Spanki)., BMC Bioinformatics 2013, vol. 14, no. 1, pp. 1-32.
An J, Lai J, Sajjanhar A, Lehman ML and Nelson CC 2014, miRPlant: an integrated tool for identification of plant miRNA from RNA sequencing data, BMC Bioinformatics, vol. 15, pp. 275.
Trapnell C, Pachter L, Salzberg SL 2009, TopHat: discovering splice junctions with RNA-Seq, Bioinformatic.  vol. 1, no. 9, pp. 1105-11.
Goecks J, Nekrutenko A, Taylor J and Galaxy Team 2010, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol, vol. 11, no. 8, pp. 86.
Langmead B1, Trapnell C, Pop M, Salzberg SL, 2010, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol, vol. 10 no. 3, pp. 25.
Bullard JH, Purdom E, Hansen KD, Dudoit S 2010, Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments, vol. 3, no. 1, pp. 34.
Maji Ranjan Kumar, Sarkar Arijita, Khatua Sunirmal, Dasgupta Subhasis and Ghosh Zhumu 2014, PVT: an efficient computational procedure to speed up next-generation sequence analysis. BMC Bioinformatics, Vol. 15, no. 1, pp. 1-26.
Torres García W, Zheng S, Sivachenko A, Vegesna R, Wang Q and etc. all (2014), PRADA: pipeline for RNA sequencing data analysis, Bioinformatics, Vol. 30, no. 15, pp. 2224-6.
Bacci, G, Bazzicalupo M, Benedetti A and Mengoni, A 2014, StreamingTrim 1.0: a Java software for dynamic trimming of 16S rRNA sequence data from metagenetic studies. Accessed by 22 April, 2016
Sigurgeirsson B, Emanuelsson O and Lundeberg J 2014, Analysis of stranded information using an automated procedure for strand specific RNA sequencing. BMC Genomics BMC Genomics 2014 Jul 28; Vol. 15, pp. 631.
Ujjawal Maulik, Sanghmitra Bandhopadhyay and Jason T.L Wang 2010, Computational Intelligence and Pattern Analysis in Biological Informatics, New Jersey, John Wiley & Sons.
Stefan Washietl, Sebastian Will, David A. Hendrix, Loyal A. Goff, John L. Rinn, Bonnie Berger and Manolis Kellis 2012, Computational analysis of noncoding RNAs, John Wiley & Son s, Ltd.

1250
0
B

625
625
B

4000
0
B

2000
2000
B

500
,
12
=
G

625
625
S

625
625
R

625
625
p

{
}
S
s
s
S
,……..,
1
=

4000
0
B

1
S

2
S

2
1
S
S
=

1
S

2
S

Up
DE
G

Down
DE
G

{
}
0
;
=
g
g
j

0
0
B

1250
0
B

625
625
B

4000
0
B

2000
2000
B

0
0
r

625
625
r

0
0
s

625
625
s

0
0
R

625
625
R

G
G
Up
DE
Í

G
G
Down
DE
Í

2
S

1
S

g

s

gs
Y

(
)
(
)
gs
gs
gs
gs
gs
mean
NB
Y
f
m
m
m
+
=
=
»
1
var
,

gs
f

gs
m

s
s
gc
G
g
s
gc
gs
gs
M
Y
E
å
Î
=
=
)
(
)
(
l
l
m

s
M

s

s
s
U
M
7
10
=

4
.
1
,
7
.
0
Unif
U
s
»

{
}
2
1
,
)
(
S
S
s
c
Î

s

gs
j

2
s

{
}
G
g
g
G
,……..,
1
=

2
s

625
625
B

625
625
p

Did it help you?

Cite this Page

Analysis of the Computational Intelligence Techniques for Rna Sequence Data Analysis. (2022, Feb 06). Retrieved from https://essaylab.com/essays/analysis-of-the-computational-intelligence-techniques-for-rna-sequence-data-analysis

Need customer essay sample written special for your assignment?

Choose skilled expert on your subject and get original paper with free plagiarism report

Order custom paper

Without paying upfront