Fig. 1.
Fig. 1.

Overall framework of the Rascaf algorithm. Step 1: Prepare the raw assembly by splitting the scaffold-level assembly at runs of Ns. Paired-end RNA sequencing (RNA-seq) reads (red) connect four contigs (blue boxes) in the raw genome assembly. Step 2: Build the exon blocks by clustering read alignments along the genome. Step 3: Build the gene blocks by connecting exon blocks by introns extracted from spliced reads. Step 4: Build the gene block graph. Each gene block is represented by two nodes connected by a block edge (thick lines); ends of contig nodes linked by paired-end reads are then connected by mate edges (thin lines). Continuous lines represent the selected block scaffolds along the heaviest path in the gene block graph, whereas dotted lines mark unselected edges in the graph. Step 5: Given a block scaffold determined above, find a set of candidate connections between contigs underlying the gene blocks. Steps 5 and 6: Build a contig graph by aggregating connections derived from multiple RNA-seq data sets. Each contig is represented by a pair of nodes connected by a contig edge (thick lines). Additionally, contigs adjacent in a scaffold in the raw assembly, or that were part of a contig connection detected in Step 5, are linked by a scaffold edge (thin lines). Step 7: Determine a set of cycle-free paths in the contig graph, using topological sorting, and use them to guide the construction of the new scaffolds.

 


Fig. 2.
Fig. 2.

Performance evaluation of programs on simulated data. Sensitivity (Sn) and precision (Pr) are defined in the text.

 


Fig. 3.
Fig. 3.

Examples of in silico validation of contig connections detected in the Pyrus communis genome. (A) Positive (validated) connection: alignments with the database homolog cover all gene blocks (marked by red tick marks along the horizontal axis) and are consistent in order and orientation. (B) Uncertain connection: alignments with the homolog do not cover the 256 bp in the second gene block. (C) Negative (incorrect) connection: alignments with the database homolog cover all gene blocks but are inconsistent in order and orientation. Here, however, the translocation is due to a misassembly within a contig of the original assembly. (D) Chimeric connection: alignments from three database homologs collectively cover all gene blocks. The chimeric construct here likely is due to the repetitive nature of the gene.

 


Fig. 4.
Fig. 4.

Gene content evaluation of the improved Arabidopsis thaliana assemblies using 0, 1, … 11 RNA sequencing data sets. Transcript coverage plots show the number of transcripts with a fraction x or more of their bases contained in the primary alignment of that transcript on the corresponding A. thaliana assembly, for coverage levels 0.05, 0.1, …, 1.0.

 


Fig. 5.
Fig. 5.

Gene content evaluation of the Fragaria iinumae and F. nipponica assemblies before and after improvement with RNA-seq data. Two data sets, SRR1930097 (F. × ananassa) and ERR430941 (F. vesca) were used as input in Rascaf. Coverage plots show the number of transcripts with a fraction x or more contained in the primary alignment.