Computational Methods for Next Generation Sequencing Data Analysis - Ion Mandoiu - ebook

Computational Methods for Next Generation Sequencing Data Analysis ebook

Ion Mandoiu

0,0
439,99 zł

Opis

Introduces readers to core algorithmic techniques for next-generation sequencing (NGS) data analysis and discusses a wide range of computational techniques and applications This book provides an in-depth survey of some of the recent developments in NGS and discusses mathematical and computational challenges in various application areas of NGS technologies. The 18 chapters featured in this book have been authored by bioinformatics experts and represent the latest work in leading labs actively contributing to the fast-growing field of NGS. The book is divided into four parts: Part I focuses on computing and experimental infrastructure for NGS analysis, including chapters on cloud computing, modular pipelines for metabolic pathway reconstruction, pooling strategies for massive viral sequencing, and high-fidelity sequencing protocols. Part II concentrates on analysis of DNA sequencing data, covering the classic scaffolding problem, detection of genomic variants, including insertions and deletions, and analysis of DNA methylation sequencing data. Part III is devoted to analysis of RNA-seq data. This part discusses algorithms and compares software tools for transcriptome assembly along with methods for detection of alternative splicing and tools for transcriptome quantification and differential expression analysis. Part IV explores computational tools for NGS applications in microbiomics, including a discussion on error correction of NGS reads from viral populations, methods for viral quasispecies reconstruction, and a survey of state-of-the-art methods and future trends in microbiome analysis. Computational Methods for Next Generation Sequencing Data Analysis: * Reviews computational techniques such as new combinatorial optimization methods, data structures, high performance computing, machine learning, and inference algorithms * Discusses the mathematical and computational challenges in NGS technologies * Covers NGS error correction, de novo genome transcriptome assembly, variant detection from NGS reads, and more This text is a reference for biomedical professionals interested in expanding their knowledge of computational techniques for NGS data analysis. The book is also useful for graduate and post-graduate students in bioinformatics.

Ebooka przeczytasz w aplikacjach Legimi na:

Androidzie
iOS
czytnikach certyfikowanych
przez Legimi
Windows
10
Windows
Phone

Liczba stron: 820




Table of Contents

Cover

Title Page

Copyright

Contributors

Preface

About the Companion Website

Part I: Computing and Experimental Infrastructure for NGS

Chapter 1: Cloud Computing for Next-Generation Sequencing Data Analysis

1.1 Introduction

1.2 Challenges for NGS Data Analysis

1.3 Background For Cloud Computing and its Programming Models

1.4 Cloud Computing Services for NGS Data Analysis

1.5 Conclusions and Future Directions

References

Chapter 2: Introduction to the Analysis of Environmental Sequence Information Using Metapathways

2.1 Introduction & Overview

2.2 Background

2.3 Metapathways Processes

2.4 Big Data Processing

2.5 Downstream Analyses

2.6 Conclusions

References

Chapter 3: Pooling Strategy for Massive Viral Sequencing

3.1 Introduction

3.2 Design of Pools for Big Viral Data

3.3 Deconvolution of Viral Samples from Pools

3.4 Performance of Pooling Methods on Simulated Data

3.5 Experimental Validation of Pooling Strategy

3.6 Conclusion

References

Chapter 4: Applications of High-Fidelity Sequencing Protocol to RNA Viruses

4.1 Introduction

4.2 High-Fidelity Sequencing Protocol

4.3 Assembly of High-Fidelity Sequencing Data

4.4 Performance of VGA on Simulated Data

4.5 Performance of Existing Viral Assemblers on Simulated Consensus Error-Corrected Reads

4.6 Performance of VGA on Real Hiv Data

4.7 Comparison of Alignment on Error-Corrected Reads

4.8 Evaluating of Error Correction Tools Based on High-Fidelity Sequencing Reads

Acknowledgment

References

Part II: Genomics and Epigenomics

Chapter 5: Scaffolding Algorithms

5.1 Scaffolding

5.2 State-of-The-Art Scaffolding Tools

5.3 Recent Scaffolding Tools

5.4 Scaffolding Software Evaluation

References

Chapter 6: Genomic Variants Detection and Genotyping

6.1 Introduction

6.2 Methods for Detection and Genotyping of SNP and Small Indels

6.3 Methods for Detection and Genotyping of CNVs

6.4 Putting Everything Together

References

Chapter 7: Discovering and Genotyping Twilight Zone Deletions

7.1 Introduction

7.2 Notation

7.3 Non-Twilight-Zone Deletion Discovery

7.4 Discovering “Twilight Zone” Deletions: New Solutions

7.5 Genotyping “Twilight Zone” Deletions

7.6 Results

7.7 Discussion

7.8 Availability

Acknowledgments

References

Chapter 8: Computational Approaches for Finding Long Insertions and Deletions with NGS Data

8.1 Background

8.2 Methods

8.3 Applications

8.4 Conclusions and Future Directions

Acknowledgment

References

Chapter 9: Computational Approaches in Next-Generation Sequencing Data Analysis for Genome-Wide DNA Methylation Studies

9.1 Introduction

9.2 Enrichment-Based Approaches

9.3 Bisulfite Treatment-Based Approaches

9.4 Conclusion

References

Chapter 10: Bisulfite-Conversion-Based Methods for DNA Methylation Sequencing Data Analysis

10.1 Introduction

10.2 The Problem of Mapping BS-Treated Reads

10.3 Algorithmic Approaches to the Problem Of Mapping BS-Treated Reads

10.4 Methylation Estimation

10.5 Possible Biases in Estimation of Methylation Level

10.6 Bisulfite Conversion Rate

10.7 Reduced Representation Bisulfite Sequencing

10.8 Accuracy as a Performance Measurement

References

Part III: Transcriptomics

Chapter 11: Computational Methods for Transcript Assembly from RNA-SEQ Reads

11.1 Introduction

11.2

De Novo

Assembly

11.3 Genome-Based Assembly

11.4 Conclusions

Acknowledgment

References

Chapter 12: An Overview And Comparison of Tools for RNA-Seq Assembly

12.1 Quality Assessment

12.2 Experimental Considerations

12.3 Assembly

12.4 Experiment

12.5 Comparison

12.6 Results

12.7 Summary and Conclusion

Acknowledgments

References

Chapter 13: Computational Approaches for Studying Alternative Splicing in Nonmodel Organisms from RNA-Seq Data

13.1 Introduction

13.2 Representation of Alternative Splicing

13.3 Comparison to Model Organisms

13.4 Accuracy of Algorithms

13.5 Discussion

References

Chapter 14: Transcriptome Quantification and Differential Expression from NGS Data

14.1 Introduction

14.2 Overview of the State-of-the-Art Methods

14.3 Recent Algorithms

14.4 Experimental Setup

14.5 Evaluation

Acknowledgments

References

Part IV: Microbiomics

Chapter 15: Error Correction of NGS Reads from Viral Populations

15.1 Next-Generation Sequencing of Heterogeneous Viral Populations and Sequencing Errors

15.2 Methods and Algorithms for The Ngs Error Correction in Viral Data

15.3 Algorithm Comparison

References

Chapter 16: Probabilistic Viral Quasispecies Assembly

16.1 Intra-Host Virus Populations

16.2 Next-Generation Sequencing for Viral Genomics

16.3 Probabilistic Reconstruction Methods

16.4 Conclusion

References

Chapter 17: Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data

17.1 Introduction

17.2 Background

17.3 Methods

17.4 Results and Discussion

Acknowledgments

References

Chapter 18: Microbiome Analysis: State of the Art and Future Trends

18.1 Introduction

18.2 The Metagenomics Analysis Pipeline

18.3 Data Limitations and Sources of Errors

18.4 Diversity and Richness Measures

18.5 Correlations and Association Rules

18.6 Microbial Functional Profiles

18.7 Microbial Social Interactions and Visualizations

18.8 Bayesian Inferences

18.9 Conclusion

References

Index

End User License Agreement

Pages

xix

xx

xxi

xxii

xxiii

xxiv

xxv

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

245

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

301

302

303

304

305

306

307

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

327

328

425

426

427

428

429

430

Guide

Cover

Table of Contents

Preface

Part I: Computing and Experimental Infrastructure for NGS

Begin Reading

List of Illustrations

Chapter 1: Cloud Computing for Next-Generation Sequencing Data Analysis

Figure 1.1 The old genome informatics ecosystem prior to the advent of next-generation sequencing technologies (3).

Figure 1.2 Historical trends in storage prices versus DNA sequencing costs (7).

Figure 1.3 Traditional computing versus physical machine with virtualization.

Figure 1.4 The workflow of MapReduce programming model (32).

Figure 1.5 Three MapReduce programming models.

Figure 1.6 The architecture of Hadoop (32).

Figure 1.7 The task programming model (33).

Figure 1.8 The architecture of Microsoft Azure (33).

Figure 1.9 The workflow of AzureBlast (53).

Figure 1.10 Workflow of RSD using the MapReduce framework on the EC2 (38).

Figure 1.11 The overview of CloudBurst (39).

Figure 1.12 The workflow of CloudAligner (41).

Chapter 2: Introduction to the Analysis of Environmental Sequence Information Using Metapathways

Figure 2.1 Beginning with nucleotide sequences as input, the MetaPathways pipeline processes input sequences in five operational stages: (i) QC & ORF prediction where minor quality control and open reading frame (ORF) prediction are performed via Prodigal; (ii) predicted ORFs are annotated using a seed-and-extend algorithm against a collection of reference databases (e.g., KEGG, COG, MetaCyc, and RefSeq); (iii) secondary taxonomic analysis include MEGAN's lowest common ancestor (LCA) taxonomy, tRNA scan, and MLTreeMap; (iv) sequences and annotations are combined into a Pathway Tools compatible format and ePGDBs are constructed; and (v) genes, reactions, and pathways are extracted for downstream analysis (39).

Figure 2.2 Short-read ORF prediction. (a) Three classic sequence features used to define an ORF are start codons or transcription initiation sites (TIS), for example, ATG, in-frame stop codons (e.g., TGA), and the presence of a 5′ upstream ribosomal binding site (RBS). (b) Current short-read sequencing length is smaller than most bacterial genes, thus many ORF sequences will truncate outside of the sequenced short read window; ORFs in short reads have a number of possible incomplete signals: (i) multiple TIS sites with upstream in-frame stop codon; (ii) in-frame stop codon absent; (iii) no valid TIS site present; (iv) no start but potential stop codon present; (v) start but no potential stop codon present; and (vi) no start, RBS, or potential stop present.

Figure 2.3 The anonymous sequence problem. Metagenomic sequences are derived from many different donor genotypes from a wide range of taxonomic groups creating a parameter training problem for many ORF prediction algorithms.

Figure 2.4 A tidal wave of data: NCBI RefSeq Protein Sequence Record growth (2004–2014). The steady increase in the number of protein reference sequences presents computational challenges to functional annotation of environmental sequence information using seed-and-extend algorithms such as BLAST or LAST.

Figure 2.5 The small-subunit SSU rRNA gene is the “gold standard” for microbial diversity studies. (a) The

E. coli

SSU rRNA transcript and its folding structure contain four main domains (5′, C, 3′M, and 3′m) and nine hyper-variable regions (V1–V9) that correspond to areas with higher mutation rates

Figure 2.6 Taxonomic assignment and functional gene profiling modules: (a) Clusters of orthologous groups (COGs) are constructed using a “triangle homology” method. Similar to rRNA genes, many COGs can be used as phylogenetic anchors. (b) MLTreeMap leverages a subset of 40 universal COGs aligned and concatenated into a “supermatrix.” Metagenomic reads can be added to this alignment and placed on the tree of life using a maximum likelihood method. (c) MEGAN parses BLAST outputs and projects this information onto the NCBI taxonomic hierarchy using the lowest common ancestor (LCA) ancestor algorithm. MEGAN also supports KEGG and SEED subsystems mapping.

Figure 2.7 ePDGB navigation. An ePGDB can be interactively queried at multiple levels of biological organization from higher level cellular and pathway views down to individual coding sequences and enzymatic reactions (39).

Figure 2.8 A master–worker model. (a) Sequence homology searches present an embarrassingly parallel problem. (b) The MetaPathways Broker distributes a BLAST job into equal-sized subtasks. The broker establishes blast services (with all the required executables and databases) on worker grids available to accept incoming tasks. Jobs (squares) are submitted in a round-robin manner to each worker grid. (c) The broker intermittently harvests results (circles) from each grid as they become available, demultiplexing if there are multiple samples being run. (d) An adversary can cause nodes or entire grids to fail at random. The broker provides fault tolerance by migrating lost jobs (dashed lines) to alternative grids. (e) An adversary can also cause intermittent or failed Internet connections (the line with the cartoon of a demon). The broker uses exponential back-off to determine job migration to other girds if latency becomes excessive.

Figure 2.9 The “Knowledge Engine” data structure. This data structure considers sequence reads or predicted ORFs as data primitives that can be summarized by projection onto a series of classification schemes using pointer following. Pointer following is a computationally efficient operation so the identification and enumeration of data primitives are robust to large environmental data sets. The connection between primitives and classification schemes are called Knowledge objects. Through the exploration of data projected on classification schemes, new Knowledge objects can be created (dashed lines). Once created, Knowledge objects can be projected onto tables or other visualization modes that can in turn be used to create new Knowledge objects, enabling iterative and interactive data exploration.

Figure 2.10 Knowledge objects projections. A Large Table view enables efficient query, look-up, and sub-setting of reads, ORFs, translated ORFs into amino acid sequences, statistics, hierarchical annotations, while a Contig View enables genome context navigation based on ORF positions with functional and taxonomic annotations for the ORFs, on both strands, appearing as tool-tips pop-ups.

Figure 2.11 Comparative pathway analysis. Two or more ePGDBs can be compared using the Cellular Overview and Omics Data features (glyphs represent pathways predicted for one sample using three different sequencing methods) (39).

Figure 2.12 Tables of long and wide format. A key “tidy data” concept is the format of long and wide tables. Algorithms implementing hierarchical clustering and cdimensionality-reduction techniques such as principal component analysis (PCA) and nonmetric multidimensional scaling (NMDS) often require the wide format (a), while many other plotting packages such as ggplot2 and lattice require the long format (b). It is important to understand how to use the melt() and dcast() functions in the reshape2 R package to convert between the two formats.

Figure 2.13 R data products. (a) Shared and unique pathways can be identified using a Venn diagram. (b) Hierarchical clustering or (c) dimensionality reduction methods such as PCA or NMDS can be used to determine the extent to which pathway profiles are shared between environmental samples. The grid visualizations framework ggplot2 can be used to create a wide variety of plots based on the Grammar of Graphics framework. In (b), pathways have been declined to the second level of the MetaCyc hierarchy. The areas of each circle indicate pathway abundance.

Chapter 3: Pooling Strategy for Massive Viral Sequencing

Figure 3.1 Combinatorial pooling strategy for viral samples sequencing.

Figure 3.2 Two pools for three samples:

has three,

has four, and

has two variants. All three samples can be reconstructed from these two pools by pool intersection and subtraction.

Figure 3.3 Phylogenetic tree representing a union of two pools: consisting of samples (shown in light grey) and consisting of samples (shown in dark grey). The intersection of two pools consists of the sample (upper right cluster in the tree); however, sequences sampled from in pools and are different.

Figure 3.4 Sequencing reduction coefficient for the pools generated by the VSPD algorithm for (a) random titer compatibility model graphs and (b) random graphs.

Figure 3.5 (a) Percentage of classified reads and (b) percentage of correctly classified reads. Bars represent a standard error.

Figure 3.6 (a) Percentage of samples without in silico contamination. (b) Total frequency of in silico contaminants within contaminated samples. Bars represent a standard error.

Figure 3.7 Root mean square error of the frequency estimations of haplotypes. Bars represent a standard error.

Figure 3.8 (a) Percentage of haplotypes from individually sequenced samples found in pooling experiment. (b) Total frequency of haplotypes from individually sequenced samples found in pooling experiment.

Figure 3.9 Phylogenetic trees of viral populations from samples –. Haplotypes obtained by individual sequencing of samples are shown in red, and haplotypes obtained from sequencing of pools are shown in blue.

Chapter 4: Applications of High-Fidelity Sequencing Protocol to RNA Viruses

Figure 4.1 (See Reference 5) Workflow. (a) DNA material from a viral population is cleaved into sequence fragments using any suitable restriction enzyme. (b) Individual barcode sequences are attached to the fragments. Each tagged fragment is amplified by the polymerase chain reaction (PCR). (c) Amplified fragments are then sequenced. (d) Reads are grouped according to the fragment of origin based on their individual barcode sequence. An error-correction protocol is applied for every read group, correcting the sequencing errors inside the group and producing corrected consensus reads. (e) Error-corrected reads are mapped to the population consensus. (f) SNVs are detected and assembled into individual viral genomes. The ordinary protocol lacks steps (b) and (d).

Figure 4.2 Overview of VGA (see Reference 5). (a) The algorithm takes as input paired-end reads that have been mapped to the population consensus. (b) The first step in the assembly is to determine pairs of conflicting reads that share different SNVs in the overlapping region. Pairs of conflicting reads are connected in the “conflict graph.” Each read has a node in the graph, and an edge is placed between each pair of conflicting reads. (c) The graph is colored into a minimal set of colors to distinguish between genome variants in the population. Colors of the graph correspond to independent sets of nonconflicting reads that are assembled into genome variants. In this example, the conflict graph can be minimally colored with four colors (red, green, violet, and turquoise), each representing individual viral genomes. (d) Reads of the same color are then assembled into individual viral genomes. Only fully covered viral genomes are reported. (e) Reads are assigned to assembled viral genomes. Read may be shared across two or more viral genomes. VGA infers relative abundances of viral genomes using the expectation–maximization algorithm. (f) Long conserved regions are detected and phased based on expression profiles. In this example, red and green viral genomes share a long conserved region(colored in black). There is no direct evidence how the viral subgenomes across the conserved region should be connected. In this example, four possible phasings are valid. VGA uses the expression information of every subgenome to resolve ambiguous phasing.

Figure 4.3 Genomic architecture of 44 real HCV viral genomes from 1739-bp long fragment of E1E2 region (see Reference 5). Length of longest common region shared between any two viral genomes is represented by color.

Figure 4.4 Assembly accuracy estimation (see Reference 5). Consensus error-corrected paired-end reads of various lengths were simulated from a mixture of 10 real viral clones from 1.3 kb-long HIV-1 region. Assembly accuracy as measured by sensitivity and PPV when variant abundances follow uniform and power-law distribution. Results are for 50,000 reads, no improvement was observed when increasing the number of reads.

Figure 4.5 Accuracy of population size prediction (see Reference 5). Up to 200 viral genomes were generated from the Gag/Pol 3.4 kb HIV region. The population diversity is 5–10%. Variant abundances follow power-law (a) and uniform (b) distributions. Highly accurate 100 2 bp paired-end reads were simulated from HIV population.

Figure 4.6 Assembly accuracy estimation (see Reference 5). Up to 200 viral genomes were generated from the Gag/Pol 3.4 Kb HIV region. The population diversity is 3–20%. Variant abundances follow power-law (a) and uniform (b) distributions. Consensus error-corrected 2 100 bp paired-end reads were simulated from HIV population.

Figure 4.7 Assembly accuracy estimation (see Reference 5). Up to 200 recombinant viral genomes were generated from the from 1.3 kb-long HIV-1 region. Variant abundances follow power-law (a) and uniform (b) distributions. Consensus error-corrected 2 100 bp paired-end reads were simulated from HIV population.

Chapter 5: Scaffolding Algorithms

Figure 5.1 SILP2 algorithm flow.

Figure 5.2 Four states A, B, C, and D.

Figure 5.3 Forbidding 3-cycles in SILP2.

Figure 5.4 Solving the maximum likelihood ILP via graph decomposition. (a) Graph decomposition into 2-connected components: Dark grey (1-cut) node splits the graph into two 2-connected components and . The ILP is solved for each component separately. If the direction of the cut node in the ILP solution for is opposite to the one in the solution for , then the solution of is inverted. Then ILP solutions for and are collapsed into the parent solution. (b) Graph decomposition into two 3-connected components: dark grey and light grey (2-cut) nodes split the graph into two 3-connected components and . The ILP is solved for component twice—for the same and the opposite directions assigned to two 2-cut nodes. Then these two solutions are used in the objective for the ILP of component . Finally, ILP solutions for and are collapsed into the parent solution.

Figure 5.5 (a) Pairwise ordering obtained from ILP output. (b) The corresponding bipartite graph. (c) The collection of simple paths and cycles in the bipartite matching. (d) The scaffold represented by a collection of paths obtained by deletion of the lightest edges from simple cycles.

Figure 5.6 Gap estimation is calculated in conformity with the formula: , where is the fragment length, are lengths of contigs and , is the left mapping position of the read , is the right mapping position of the read .

Figure 5.7 Contigs , , and with connecting bundles of read pairs and the corresponding scaffolding graph. Each contig is split into two nodes connected with a dummy edge. Each bundle of read pairs corresponds to an inter-contig edge connecting respective strands with the weight equal to the size of the bundle.

Figure 5.8 (a) A scaffold –––: the connection of each pair of adjacent contigs is supported by bundles of read pairs. (b) A path ––––––− in the scaffolding graph corresponding to the scaffold –––. (c) The matching of the scaffolding graph corresponding to the bunches of read pairs supporting adjacent contigs.

Figure 5.9 Insertion procedure: (a) The matching scaffold ––– is obtained with the maximum weight matching; the contig is connected with edges to all four contigs of the matching, the contig is connected to and ; should be placed between and according to the consensus of connecting edges and should be placed between and . (b) Since there is a sufficient distance between contigs and , is placed there, that is, the edge from the matching is replaced with and (the sum of weights of and is less than the weight of ); since there is no sufficient room for between contigs and , the edges and are removed. The resulted scaffold is –––.

Chapter 6: Genomic Variants Detection and Genotyping

Figure 6.1 Types of genomic variation between the chromosomes of two samples. Structural variants include indels, copy number variants (CNVs), inversions, and translocations. Examples of (a) single - nucleotide polymorphism (SNP) and (b) tandem repeat are shown in the zoomed-in regions.

Figure 6.2 Differences between the gold-standard reference assembly of the rice cultivar Nipponbare and an Illumina whole-genome resequencing experiment of the same sample classified by read position from to . Given the inbred nature of rice cultivars, most of the observed differences are due to sequencing errors.

Chapter 7: Discovering and Genotyping Twilight Zone Deletions

Figure 7.1 (A) Alignment whose interval length indicates a deletion, (B) alignment whose interval length indicates an insertion, (C) alignment where a split (in the left end) indicates a deletion, (D) alignment where a split (in the right end) indicates an insertion.

Figure 7.2 Internal segment size-based evidence for a deletion: the piece of sequence “GGTGGGGGAGG” is present in the reference but deleted in the donor genome. The length of the fragment that is sequenced (in green) is determined during library preparation. When mapped back onto the reference, the internal segment is longer than due to the deletion.

Figure 7.4 Split-read evidence for deletion.

Figure 7.3 Internal segment size distribution for GoNL individual.

Figure 7.5 MATE-CLEVER. First, the internal segment size-based tool CLEVER discovers deletions (red). The split-read aligner LASER then finds corresponding split-read alignment (blue) in the respective regions. The resulting prediction (red-blue) is that of LASER, as split-read aligner discovers deletion breakpoints at higher accuracy.

Figure 7.6 Different types of evidence for a heterozygous variant. While the gray alignment rather provide evidence against a deletion, the alignments in red rather provide evidence for it. In case of internal segment evidence with alignments (counting alignments from above) 3–6, and gray alignments with alignments 1, 2, and 7 reflect the case in (7.9), whereas the gray alignments reflect the opposite case.

Figure 7.7 Gaussian distribution on interval size for alignments of normal reads (read) and reads indicating a deletion of length . Alignments whose intervals are of length provide no evidence, as both the existence and the nonexistence of the deletion are equally likely.

Chapter 8: Computational Approaches for Finding Long Insertions and Deletions with NGS Data

Figure 8.1 Donor genome with one long deletion and insertion. The Figure shows that it has been aligned to reference genome.

Figure 8.2 A paired-end read. Different from Figure 8.1, which uses one line to represent a genome, in this Figure we use two lines to represent a segment of a double-stranded genome. A paired-end read is generated from this segment of DNA with each end from of each strand. The length of the segment is called insert size.

Figure 8.3 IGV view of an example of long deletion. The deletion is from 250,001 to 251,000. We can see that the read coverage in the region of deletion is lower, and there are many discordant paired-end reads with very large insert sizes.

Figure 8.4 IGV view of an example of long insertion. The insertion is between 750,000 and 750,001.

Figure 8.5 Insert size of an encompassing paired-end read is enlarged after mapping because of the deletion. When generated from the donor genome, the insert size of an encompassing paired-end read is normal (bottom). But because it encompasses a deletion, when mapped onto the reference genome, the insert size is enlarged by the size of the deletion (top).

Figure 8.6 Signature of paired-end split read for long deletion. One end of the read is generated spanning the breakpoint of a long deletion on the donor genome (bottom). The sequence of the end does not match a consecutive sequence in the reference genome. It has to be split in two separate segments, with each mapped onto a flanking region of the deletion on the reference genome.

Figure 8.7 Signature of paired-end reads for long insertion. The bottom shows where the sequences of two signature reads come from. The top shows the signatures they have when mapped onto the reference genome.

Figure 8.8 Type I pattern of deletion calling. Read 1 is a split read. Read 2 is an encompassing pair. Read 3 is a pair on the other haplotype without the deletion.

Figure 8.9 Type II pattern of deletion calling. Read 4 itself is an encompassing pair. The left end is split.

Chapter 9: Computational Approaches in Next-Generation Sequencing Data Analysis for Genome-Wide DNA Methylation Studies

Figure 9.1 Sequencing methods based on enrichment and bisulfite conversion for DNA methylation studies. A genome is fragmented by sonication or a restriction enzyme. After repairing the ends of fragments by linker ligation, fragments that are captured by MBD or MeDIP (left) or bisulfite treated (right) are amplified and sequenced.

Figure 9.2 The number of mapped reads at and is the same, but their methylation level is different because of the different number of CpGs within captured fragments, that is, the region around shows twice as much methylation as that around .

Figure 9.3 Flow diagram of the MeQA analysis pipeline for MeDIP-seq.

Figure 9.4 Tag density plots around RefSeq genes (a) and CpG islands (b). Each line represents the normalized mean read coverage for a sample.

Figure 9.5 Results of saturation test by MEDIPS (a) and Michaelis–Menten kinetics (b).

Figure 9.6 Scatter plot of CpG Coupling factor and RPKM (read depth) as marked in the magenta. The red and blue lines represent MeDIPseq and input read density for coupling level, respectively. The green line represents estimated linear fit of the proposed method.

Figure 9.7 Scatter plot of RPM (Reads Per Million reads) and RMS (Relative Methylation Score).

Figure 9.8 Several fitting models with performance measured by the goodness of fit.

Figure 9.9 Flow diagram of BALM.

Figure 9.10 Example showing how to compute methylation score for CpGs or windows.

Figure 9.11 Methylation entropy for four examples. While A and B have the same methylation score but different methylation entropy, C and D have the same entropy but different score.

Figure 9.12 Examples for QDMR entropy computation where the red and green points represent the original and processed methylation scores for samples, respectively, and , , and represent the raw, processed, and QDMR methylation entropy, respectively.

Figure 9.13 Examples with the same and but different .

Figure 9.14 Algorithm to identify DMRs, CMRs, and CUMRs among samples.

Figure 9.15 Example showing how to smooth methylation scores.

Chapter 10: Bisulfite-Conversion-Based Methods for DNA Methylation Sequencing Data Analysis

Figure 10.1 (a) DNA is treated with sodium bisulfite followed by PCR amplification, which results in unmethylated cytosines converted into thymines and methylated cytosines remained unchanged. (b) Directional bisulfite libraries generate reads from and strands corresponding to the original positive and negative genomic strands, respectively, and nondirectional bisulfite libraries also generate reads from and strands. (c) Mapping of sequenced BS-reads to the positive genomic strand is shown. The reads from and strands contribute to estimation of methylation states of the cytosines on the positive genomic strand, and the reads from and strands contribute to estimation of methylation states of the cytosines on the negative genomic strand.

Figure 10.2 Mapping efficiency and accuracy of alignment using simulated reads. The total number of uniquely mapped reads as a function of mapping accuracy for the three tools is shown. 120,000 simulated 80-bp long reads from human genome GRCh38 were generated according to the following conditions: (1) 2% of total bases of sequencing errors were introduced; (2) 1% of total bases of SNPs were randomly generated; (3) indels of random length with the maximum length of 10 bp were introduced to 1% of total reads; and finally (4) adapter sequences of random length with the maximum length of 10 bp were inserted at -end of reads to 10% of total reads. The numbers inside the symbols correspond to the number of option sets used with these three tools and shown in Table 10.1.

Figure 10.3 Performance in methylation call accuracy. Recall as a function of FDR (false discovery rate) is shown for the three tools and five different cutoff thresholds for determining the methylation status of cytosines for the data set of artificial 80-bp long reads generated from chromosome Y of human genome GRCh38. A total of 10 million reads were generated with 2% sequencing errors and 1% SNPs introduced randomly. Only cytosines covered by at least 10 reads were considered in this analysis. The total number of cytosines in chromosome Y was 9,262,721 on both strands of which BRAT-bw had 4,860,740 cytosines covered with at least 10 reads, Bismark had 4,900,224, and BS-Seeker2 had 4,656,122.

Figure 10.4 The distribution of a random sample of 325,000 absolute values of errors for each tool is shown. Absolute values of error are measured as absolute values of the difference between the true methylation level at each cytosine covered by at least 10 reads and the methylation level calculated by an aligner for the corresponding cytosines.

Chapter 11: Computational Methods for Transcript Assembly from RNA-SEQ Reads

Figure 11.1 Classes of transcript reconstruction methods. (a)

De novo

methods assemble reads based on sequence overlaps, represented as either an overlap graph or a de Bruijn graph. (b) Genome-based methods first map reads to a reference genome allowing for introns, then combine read alignments into a graph (overlap, connectivity, or splicing graph). Transcripts are enumerated from the graph and a final subset is selected using a variety of methods.

Figure 11.2 De Bruijn graph for two splice variants, , and , sampled by five reads. The expanded graph is shown at the top and the graph with compressed nonambiguous paths at the bottom. The alternative splicing event appears as a “bubble” with one of the two paths of length .

Figure 11.3 Inconsistencies occurring with paired-end reads.

Figure 11.4 Example of splicing graph and overlap graph. (a) Three different isoforms are sampled by the reads. Isoform uses a different splice site in the middle exon, whereas isoform skips the exon completely. (b) The overlap graph connects compatible reads and the three possible paths from the leftmost node (source ) to the rightmost node (sink ) resemble the three original isoforms. Pairs of reads are denoted by a gray dotted line, spliced alignments by a black dashed line. (c) The splice graph contains four nodes representing the four different exons and connects them as denoted by the directed edges.

Chapter 12: An Overview And Comparison of Tools for RNA-Seq Assembly

Figure 12.1 Paired-end reads. (a) A paired-end set of reads: the arrows represent the read length, and the position of the arrow indicates strand (top or bottom). The outer bracket represents the length of the fragment being sequenced while the inner bracket represents the mate pair inner distance. (b) Forward read (F, top strand); the header of a read is terminated with the read number followed by 1 or 2 after (/) denoting forward and reverse reads, respectively. (C) Reverse read (R, bottom strand); for example, the quality score at position 3 of the reverse read is encoded by “:”; this represents a phred score of 25 (ASCII_value (“:”) ).

Figure 12.2 Outline for comparison of RNA-seq tools in this study.

Figure 12.3 Output from FastQC. (a) The per-base sequence quality scores of the reads clearly indicates high-quality reads with the mean-per-base error rate better than 1 in 1000 (phred score is better than 30). (b) The per-base sequence content. (c) The kmer content in the reads.

Figure 12.4 The performance relationship in Table 12.2 is illustrated. The sensitivity (Sens) of both reference-based assemblers outperformed those of the

de novo

assemblers. Abbreviations: soap (SOAPdenovo-Trans), trin (Trinity).

Figure 12.5 The performance relationship in Table 12.3 is illustrated. The performances of all the assemblers are slightly increased with the increased coverage. Abbreviations: soap (SOAPdenovo-Trans), trin (Trinity).

Figure 12.6 Common transcripts in assemblers with twofold coverage. Tophat and SpliceMap share around 90% of the unique true transcripts. SOAPdenovo-Trans and Trinity share around 70% of the unique transcripts in with fractional coverage better than 50% and 90%. Abbreviations: th (Tophat), sm (SpliceMap), soap (SOAPdenovo-Trans), trin (Trinity).

Figure 12.7 Common transcripts in assemblers with fivefold coverage. Tophat and SpliceMap share over 90% of the unique true transcripts. SOAPdenovo-Trans and Trinity share over 70% of the unique transcripts with fractional coverage better than 50% and 90%. Abbreviations: th (Tophat), sm (SpliceMap), soap (SOAPdenovo-Trans), trin (Trinity).

Chapter 13: Computational Approaches for Studying Alternative Splicing in Nonmodel Organisms from RNA-Seq Data

Figure 13.1 Illustration of RNA splicing and alternative splicing. In eukaryotes, genes are made up of exons and introns. Introns are cut out of an RNA, and exons are concatenated together to form an mRNA before translation into a protein. There is usually more than one way to define these exons and introns, leading to alternative splicing.

Figure 13.2 Illustration of transcriptome assembly strategy from RNA-Seq data in nonmodel organisms.

Figure 13.3 Example of the construction of a de Bruijn graph with from a given set of reads. Each read can be obtained from one of the paths in the de Bruijn graph by sliding a window of size along the read. The corresponding set of transcripts in FASTA format and splicing graph are also shown.

Figure 13.4 Sensitivity and specificity comparisons of Oases, Trans-ABySS, and Trinity with respect to mRNA BLAST results on three

D. melanogaster

libraries over different values of coverage cutoff with the -mer length fixed to 25. Sensitivity is defined to be the percentage of nucleotide positions in the

D. melanogaster

transcriptome that are recovered through the top BLAST alignments from each predicted transcript in the assembly considering only

D. melanogaster

gene transcripts that are found in BLAST hits. Specificity is defined to be the percentage of predicted transcript positions in the assembly that are included in the top BLAST alignments considering only positions that have BLAST hits.

Figure 13.5 Sensitivity and specificity comparisons of Oases, Trans-ABySS, and Trinity with respect to alternative splicing junctions on three

D. melanogaster

libraries over different values of coverage cutoff with the -mer length fixed to 25. Sensitivity is defined to be the percentage of junctions in the

D. melanogaster

gene transcripts that appear somewhere in the assembly. Specificity is defined to be the percentage of junctions in the assembly that appear somewhere in the

D. melanogaster

gene transcripts. Junctions in the

D. melanogaster

gene transcripts are defined by concatenating the two sequences of length that are immediately to the left and to the right of all alternatively spliced locations to obtain a sequence of length . Junctions in the assembly are defined by concatenating the two nonoverlapping -mers at the beginning and ending nodes of an edge in the de Bruijn graph to obtain a sequence of length . Up to three mismatches are allowed when looking for occurrences of these sequences.

Chapter 14: Transcriptome Quantification and Differential Expression from NGS Data

Figure 14.1 Screenshot from Genome browser.

Figure 14.2 Paired reads and are simulated from the transcript . Each read is mapped to all other transcripts (). Mapping of the read into the transcript is not valid since the fragment length is 4 standard deviations away from the mean. Then each read is assigned to the corresponding read class—the read is placed in the read class and the read is placed in the read class .

Figure 14.3 Sensitivity, PPV, and F-score of IsoDE-Match ( bootstrap samples per condition) on the Illumina MAQC data, with varying bootstrap support threshold.

Figure 14.4 Screenshot from Genome browser of a gene with 21 subtranscripts.

Figure 14.5 Sensitivity, PPV, -score, and accuracy of IsoDE-All (with 20 bootstrap runs per condition), edgeR, and GFOLD on the Illumina MCF-7 data with minimum fold change of 1 and varying number of replicates.

Figure 14.6 Sensitivity, PPV, and -score of IsoDE-All (with 20 bootstrap runs per condition), edgeR, and GFOLD on the Illumina MCF-7 data, computed for quantiles of expressed genes after sorting in nondecreasing order of average FPKM for IsoDE and GFOLD and average count of uniquely aligned reads for edgeR. First quantile of edgeR had 0 differentially expressed genes according to the ground truth (obtained by using all 7 replicates). *there are no gene present in this quantile. edgeR is not able to detect DE genes when their expression level is very low.

Figure 14.7 Sensitivity, PPV, -score, and accuracy of IsoDE-All (with 20 bootstrap runs per condition), edgeR, and GFOLD on the Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5.

Figure 14.8 Sensitivity, PPV, -score, and accuracy of IsoDE-All (with 20 bootstrap runs per condition), edgeR, and GFOLD on the Illumina MCF-7 data with varying number of replicates and minimum fold change 2.

Chapter 15: Error Correction of NGS Reads from Viral Populations

Figure 15.1 Error profile of single-clone samples. Three types of errors are shown: nucleotide replacements, non-homopolymer indels, and indels in homopolymer.

Figure 15.2 Frequency of the true haplotype in single-clone samples 15. In each pair of bars, left bar show the percentage of all reads from true haplotypes and right bar show the frequency of the most common false haplotype. The average percentage of error-free reads (true sequence) in single-clone samples is . The most common false haplotype was found with an average frequency of but can be as frequent as (sample S4).

Figure 15.3 Minimum spanning tree of a distance graph of the NGS data set obtained from a single-clone sample. Each node represents a unique haplotype. The diameter of the node is proportional to its frequency. The true haplotype is shown in red, haplotypes with indel errors are shown in yellow, haplotypes with nucleotide substitutions are shown in blue, and haplotypes with both types of errors are shown in green. Here, is a complete weighted graph with the vertices corresponding to unique reads, and the weight of each edge is equal to the distance between the corresponding reads.

Figure 15.4 (a) -counts of -mers of a read; (b) distribution of -counts in a viral amplicon data set.

Figure 15.5 Distribution of lengths of error regions for one of samples studied in Reference 15.

Chapter 16: Probabilistic Viral Quasispecies Assembly

Figure 16.1 A schematic representation of a quasispecies. Haplotypes are represented as dots and the corresponding frequency as size. The quasispecies emerges from a small set of master sequences, inner circle, by recombination, middle ring, and mutation, outer ring. Edges between dots represent the closest distance between haplotypes.

Figure 16.2 For different scenarios, schematic fitness landscapes are visualized. For each scenario, the fitness, reproductive capability, is shown as a function of the discrete sequence space. Under the assumption that only one haplotype has a high fitness (a), mutant variants reproduce with a negligible capability or not at all. In the survival of the flattest scenario, for example, with one (b) or two master sequences (c), mutant variants have only a slightly reduced fitness.

Figure 16.3 Polyprotein coding region of HIV-1 with reference positions of strain HXB2. The three functional regions gag, pol, and env are shown in their respective reading frames.

Figure 16.4 Two heterozygous virions of distinct subtypes infect the same host cell and as a subsequent event, virions with recombinant genomes are produces.

Figure 16.5 Sequencing workflow from the purified RNA genome to the final reads. The workflow is visualized for Illumina paired-end and PacBio single-molecule sequencing.

Figure 16.6 Given three distinct haplotype sequences, with their SNVs in bold and true frequency annotated, a paired-end alignment of the reads to its consensus sequence with ambiguous bases is illustrated. Sequencing errors are indicated as pink characters. Reconstruction can be performed on three different spatial scales: position-wise SNV calling (solid box), local with a maximal length of a read (dotted box), and long-range (dashed box).

Figure 16.7 Schematic overview for a haplotype structure with a conserved region. (a) The underlying two true haplotypes, which are identical in the red colored region. (b) The multiple sequence alignment of reads sampled from the two haplotypes. (c) Paired-end reads with an insert size longer than the conserved region. (d) Very long reads that are longer than the conserved region.

Figure 16.8 (a) Two human haplotype DNA sequences of length , (b) only the variant sites, (c) its binary representation with 1 for the major allele, and (d) the genotype.

Figure 16.9 Schematic workflow of QuasiRecomb. The input data is an alignment of short erroneous reads from an ultradeep sequencing experiment. Sequencing errors are depicted as dots. Given the alignment, sequence profile generators and their frequencies are inferred. Mutations and recombination patterns between generators are illustrated as triangles and arrows, respectively. Given the estimated sequence profile generators, the haplotype distribution is sampled from the model.

Figure 16.10 Graphical representation of the HMM of QuasiRecomb. Only one observation is depicted; for the full model, the graph is replicated for indicated by the plate notation. Variable represents the master sequence, the nucleotide of the emitted haplotype, and the nucleotide of the observed read at position .

Figure 16.11 (a) Schematic illustration of a local reconstruction model with three profile generators , and . Given a read that covers the full length of interest, the Viterbi path is shown in black as a mosaic of the profile generators. (b) Transition state diagram in plate notation. Transition between plates indicates all possible transitions between profile sequence generators.

Figure 16.12 (a) Schematic illustration of a long-range reconstruction model with three profile generators , and and silent begin () and end () states. The read is shorter than the region of interest and can arbitrarily be located in the region of interest. (b) Transition state diagram in plate notation. Transition between plates indicates all possible transitions between profile sequence generators. Silent states to and to enable read placement.

Figure 16.13 Schematic illustration of a paired-end long-range reconstruction model with two profile generators and , their respective silent states and , and silent begin () and end () states. The unsequenced fragment between reads is shown as a dashed line between read pairs. The additional silent insertion states allow recombination to occur between these two observed read pairs. (B) Transition state diagram in plate notation. Transition between plates indicates all possible transitions between profile and silent sequence generators. Silent states to and to enable read placement.

Chapter 17: Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data

Figure 17.1 Evaluated reconstruction flows. The evaluated reconstruction flows consists of three steps: (i) read error correction (ii) read alignment, and (iii) reconstruction of viral quasispecies.

Figure 17.2 Read coverage. Number of reads covered every position in S1 gene.

Figure 17.3 Schematic representation of calibration, validation experiments based on Sanger clones

Figure 17.4 The distribution of 454 IBV reads categories (edit distance to the closest Sanger clone) for different correction methods.

Figure 17.5 Phylogenetic tree over collapsed Sanger clones and collapsed reconstructed variants inferred from the method with parameters 1_2_5 on KEC corrected reads using ViSpA.

Figure 17.6 Phylogenetic tree over collapsed Sanger clones and collapsed reconstructed variants inferred from one of the dominating methods with parameters 2_2_10 on KEC corrected reads using ViSpA.

Figure 17.7 Phylogenetic tree over collapsed Sanger clones and collapsed reconstructed variants inferred from one of the dominating methods with parameters 1_2_0 on SAET corrected reads using ViSpA.

Figure 17.8 Phylogenetic tree over collapsed Sanger clones and collapsed reconstructed variants inferred from one of the methods with default parameters on Uncorrected reads using ShoRAH (close to the dominating methods).

Figure 17.9 Evaluation diagram for average prediction error (APE) and average distance to clones (ADC) values for different methods. Each point corresponds to a method and the dominant solutions correspond to red points.

Chapter 18: Microbiome Analysis: State of the Art and Future Trends

Figure 18.1 Plot of the standard measures of richness and diversity for a collection of samples. Each bar represents the richness (i.e., estimated number of OTUs present in the sample), and the line graph indicates the diversity of the sample. Bars are color-coded (grey shades) by clinical category and are arranged in descending order. However, no apparent pattern is discernible between the categories that are being compared, either in richness or in diversity (unpublished data).

Figure 18.2 Example of a bubble plot for visualizing association rules. Columns consist of overlapping antecedents (item sets) leading to the consequents (the rules) in the rows. Bubble size indicates support and color the strength of the interest measure.

Figure 18.3 Metabolic map of a module identified from a gene network. Nodes symbolize compounds, and lines connecting nodes are enzymes. All enzymes (lines) corresponding to a single KEGG map have the same color. All enzymes (lines) corresponding to a single module are highlighted and colored with module color.

Figure 18.4 Basic network diagram where nodes represent OTUs and edges represent co-occurrence in subjects. Edge color indicates positive (green) or negative (red) correlations. Node size is adjusted to reflect relative abundance using a log scale. A force-directed layout using the Fruchterman–Reingold algorithm is used. The position of each node is dependent on the strength of its interactions with all other nodes in the system. A heatscale has been used to assign a color to each node based on differential abundance between two groups of subjects. The greater the significance in the difference, the hotter (redder) the color of the node.

Figure 18.5 Once clubs have been identified, rival clubs are characterized by many negative edges between them.

Figure 18.6 The likelihood of event (“lawn is wet”) depends only on the probability of event (“it rained”). In contrast, the likelihood of event (“neighbor's lawn is wet”) depends on event and the probability of the event (“neighbor turned on the sprinkler”).

Figure 18.7 A complex PGM.

List of Tables

Chapter 1: Cloud Computing for Next-Generation Sequencing Data Analysis

Table 1.1 Cloud Resources for NGS Data Analysis

Chapter 3: Pooling Strategy for Massive Viral Sequencing

Table 3.1 Comparison of Frequency Distributions for Individually Sequenced and Pooled Samples

Chapter 5: Scaffolding Algorithms

Table 5.1 Scaffolding Data Sets (29, 30)

Table 5.2 Performance of Different Algorithms on the

S. aureus

Data Set

Table 5.3 Performance of Different Algorithms on the

R. sphaeroides

Data Set

Table 5.4 Performance of Different Algorithms on the

P. falciparum

Data Set (short)

Table 5.5 Performance of Different Algorithms on the

P. falciparum

SData set (Long)

Table 5.6 Performance of Different Algorithms on the Combined

P. falciparum

Data Set (Short

Long)

Table 5.7 Performance of Different Algorithms on the

H. sapiens

(chr 14) Data Set (Short)

Table 5.8 Performance of Different Algorithms on the

H. sapiens

(chr 14) Data Set (Long)

Table 5.9 Performance of Different Algorithms on the

H. sapiens

(chr 14) Data Set (Short + Long)

Chapter 7: Discovering and Genotyping Twilight Zone Deletions

Table 7.1 List of Used Software Tools

Table 7.2 Results for SV Prediction Tools on 30 HiSeq/MiSeq Data for Deletions from 10 to 69 bp

Table 7.3 Results for SV Prediction Tools on 30 HiSeq/MiSeq Data for Deletions from 70 to 199 bp

Table 7.4 Genotyping Performance for

Homozygous

Calls

Table 7.5 Genotyping Performance for

Heterozygous

Calls

Chapter 8: Computational Approaches for Finding Long Insertions and Deletions with NGS Data

Table 8.1 Simulation Precision and Sensitivity of SVseq1 and SVseq2 for Pooled Population Data Sets

Chapter 9: Computational Approaches in Next-Generation Sequencing Data Analysis for Genome-Wide DNA Methylation Studies

Table 9.1 Cytosines in Sequence Reads Based on Different Sequencing Protocols

Chapter 10: Bisulfite-Conversion-Based Methods for DNA Methylation Sequencing Data Analysis

Table 10.1 The Command-Line Options Used with the Three Tools from the Experiment for Mapping Accuracy Analysis on 120,000 Synthetic 80-bp Long Reads Generated from Human genome GRCh38

Chapter 11: Computational Methods for Transcript Assembly from RNA-SEQ Reads

Table 11.1

De Novo

Transcriptome Assembly Methods and Their Properties

Table 11.2 Overview of Concepts Described in Sections 11.3.1-11.3.4

Chapter 12: An Overview And Comparison of Tools for RNA-Seq Assembly

Table 12.1 Simulated Human Paired-End Reads

Table 12.2 The Average Value of True Positive Rate of Assemblers with Twofold Transcript Coverage

Table 12.3 The Average Value of True Positive Rate of Assemblers with Fivefold Transcript Coverage

Table 12.4 The Percentage of the Transcripts that Are in Common between Assemblers with a Range of Different FPKM Thresholds

Table 12.5 The Percentage of the Transcripts that Are Common between Assemblers with a Range of Different FPKM Thresholds

Chapter 13: Computational Approaches for Studying Alternative Splicing in Nonmodel Organisms from RNA-Seq Data

Table 13.1 Examples of Publicly Available RNA-Seq Libraries in Nonmodel Organisms, with Organism Denoting the Nonmodel Organism, Closest Model Denoting the Closest Model Organism, Library Denoting the Number of Libraries, and Reference Denoting the Publication that Describes the Libraries

Table 13.2 Comparisons of Transcriptome Assemblies of Oases, Trans-ABySS, and Trinity on Three

D. melanogaster

Libraries over Different Values of Coverage Cutoff

Chapter 14: Transcriptome Quantification and Differential Expression from NGS Data

Table 14.1 Confusion Matrix for Differential Gene Expression

Table 14.2 Comparison Results between SimReg and RSEM

Table 14.3 Median Percent Error (MPE) and Together with 95% CI for Transcriptome Quantification on MAQC and NanoString Data Sets (1)

Table 14.4 Accuracy, Sensitivity, PPV, and -Score in % for MAQC Illumina Data Set and Fold Change Threshold of 1, 1.5, and 2

Table 14.5 Accuracy, Sensitivity, PPV, and -Score in % for Ion Torrent Data Set and Fold Change Threshold of 1, 1.5, and 2

Table 14.6 Accuracy, Sensitivity, PPV, and -Score in % for the First 454 Dataset and Fold Change Threshold of 1, 1.5, and 2

Table 14.7 Accuracy, Sensitivity, PPV, and -Score in % for ION Torrent Pair HBRR: LUC-140_265 and UHRR: POZ-126_269, with Fold Change Threshold of 1, 1.5, and 2

Table 14.8 Accuracy, Sensitivity, PPV, and -Score in % for ION Torrent Pair HBRR: GOG-139_281 and UHRR: POZ-127_270, with Fold Change Threshold of 1, 1.5, and 2

Table 14.9 IsoDE setup for experiments with replicates

Chapter 15: Error Correction of NGS Reads from Viral Populations

Table 15.1 Error Rates for Different NGS Platforms

Table 15.2 Frequencies of 10 Clones in 24 Samples from 454/Roche GS FLX Experiment

Table 15.3 Algorithms Comparison for Single-Clone (S) and Mixture (M) samples

Table 15.4 Algorithms Performance on Average

Chapter 16: Probabilistic Viral Quasispecies Assembly

Table 16.1 Comparison of Next- and Third-Generation Sequencing Technologies (33, 34)

Chapter 17: Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data

Table 17.1 Pairwise Edit Distance between the 10 Sanger Clones

Table 17.2 Edit Distance between Collapsed Sanger Clones and ViSpA Reconstructed Variants Using Parameters 1, 2, 5 (Number of Mismatches between Sub-Reads and Super-Reads, Number of Mismatches between Two Overlapped Reads and Mutation Rate, Respectively), Threshold0.005 on KEC Corrected Reads

Table 17.3 Edit Distance between Collapsed Sanger Clones and ViSpA Reconstructed Variants Using Parameters 2, 2, 10 (the Number of Mismatches between Sub-Reads and Super-Reads, the Number of Mismatches between Two Overlapped Reads, and Mutation Rate, Respectively), Threshold0.005 on KEC Corrected Reads, where 85% of Reconstructed Variants Have Perfect Match with 65% of the Sanger Clones

Table 17.5 Edit Distance between Collapsed Sanger Clones and ShoRAH Reconstructed Variants Using Default Parameters, Threshold0.005 on Uncorrected Reads

Table 17.6 Average Distance to Clones (ADC) for the Reconstructed Variants Using Different Methods

Table 17.7 Average Prediction Error (APE) for the Reconstructed Variants Using Different Methods

Wiley Series on

Bioinformatics: Computational Techniques and Engineering

A complete list of the titles in this series appears at the end of this volume.

Computational Methods for Next Generation Sequencing Data Analysis

Edited by

 

 

Ion I. Măndoiu

Alexander Zelikovsky

 

 

 

Copyright © 2016 by John Wiley & Sons, Inc. All rights reserved

Published by John Wiley & Sons, Inc., Hoboken, New Jersey

Published simultaneously in Canada

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.

Library of Congress Cataloging-in-Publication Data:

Names: Măndoiu, I. Ion, editor of compilation. | Zelikovsky, Alexander, editor of compilation.

Title: Computational methods for next generation sequencing data analysis / edited by Ion I. M\u{a}ndoiu, Alexander Zelikovsky.