RNA-Seq de novo analysis

Demo Report


Figure 1.Experiments pipeline

Poly-A containing mRNA molecules are purified using poly-T oligo-attached magnetic beads from eukaryote (prokaryocyte can be treated with kit to remove rRNA before next step). Fragmentation mix is added to fragments the mRNA. The cDNA is synthesized by using random Hexamer priming. The second-strand is generated to create double-strand cDNA . Purify the cDNA templates by the use of qiagen kit followed by end repair, poly A tailing and adaptor connection. At last, the library could be sequenced using IlluminaHiSeq™ 2500.



Figure 2. Bioinformatics pipeline

Table 1. Sequencing summary
Biological Sample
Total Illumina Reads
Read Length
Total Base
Total Reads after QT
Cleaned Read Length
Total Clean Base

To check the reads quality after sequencing, nucleotides with low quality scores (< 13) are firstly trimmed from the sequence reads. First step in the trim process is to convert the quality score (Q) to error probability. Next, for every base a new value is calculated:

0.05-Error probability

This value will be negative for low quality bases, where the error probability is high. For every base, we calculate the running sum of this value. If the sum drops below zero, it is set to zero. The part of the sequence to be retained after trimming is the region between the first positive value of the running sum and the highest value of the running sum. Everything before and after this region will be trimmed off. A read will be completely removed if the score never makes it above zero. In addition, if the read length is shorter than 35bp, the read will be discarded.


To perform de novo assembly, we pooled clean reads, which were quality trimmed and removed control sequences from samples and adopt Trinity [Grabherr MG et al., 2011]. Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. It combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. It also performs best with strand-specific data, in which case sense and antisense transcripts can be resolved.

All contig sequences generated by Trinity are in FASTA format (transcripts.fa) and the description line of the FASTA format is like “>comp1_c0_seq1 len=2364”. "comp1_c0" can be regarded as genetic region 1. If this genetic region have three alternative forms, the other two transcripts are described to ">comp1_c0_seq2" and ">comp1_c0_seq3",respectively. Here, we pick the contig sequence with the longest one at the same genetic region as the unigene (e.g. unigenes.fa). The statistic of do novo assembly shows in Table 3 and 4 and the length distribution of unigenes shows in Figure 3.

Figure 3. Length distribution of unigenes


Table 2. The De novo statistic of all contigs sequences

GC content 43.80%
Max contig length 6,532
Min contig length 203
Avg contig length 693.01
Total contig length 48,935,469
N90 303
N80 438
N70 638
N60 813
N50 1,033
NG50 1,033
Number of contig 70,846
Number of contig >= 300b 50,946
Total bases in contigs > 300b 44,113,883
The percentage of contig length >= 300b 90.10%
Number of contig >= 1kb 16,039
Total bases in contigs > 1kb 23,579,459
The percentage of contig length >= 1kb 53.20%




Table 3. The De novo statistic of unigenes

GC content 44.93%
Max contig length 6,532
Min contig length 203
Avg contig length 604.31
Total contig length 28,723,633
N90 233
N80 353
N70 483
N60 673
N50 883
NG50 883
Number of contig 47,535
Number of contig >= 300b 30,423
Total bases in contigs > 300b 24,569,132
The percentage of contig length >= 300b 85.30%
Number of contig >= 1kb 8,303
Total bases in contigs > 1kb 12,899,433
The percentage of contig length >= 1kb 43.90%






We map trimmed reads of the sample to the unigene sequences by bowtie2 and using gapped alignment mode [Langmead, et al., 2012]. This method is essential for variant discovery because the sequence reads may contain insertion-deletion polymorphism (INDEL). Without this alignment approach, a read may still be mapped onto the correct position but with consecutive mismatches at INDEL locations.

After alignment, we quantify gene expression by eXpress and show the results in EXCEL file (unigene_RPKM.xlsx) [Roberts, et al., 2012]. The special term in the EXCEL file is RPKM. RPKM, expected fragments per kilobase of transcript per million fragments sequenced, is defined [Trapnell, et al., 2010]:

The total exon reads is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. The mapped reads (million) is the total number of reads that after mapping have been mapped to the region of the gene. For paired-end reads, when two reads map as a pair, the pair is counted as one. If the pair is broken, none of the reads are counted. The exon length (KB) is calculated as the sum of the lengths of all exons annotated for the gene, divided by 1000.

Table 4. Alignment Summary

Total cleaned reads
Reads which mapped to Bo's unigene

Table4. Summary of gene annotation

Number of genes with hit
Predicted genes with hit in NCBI nr database

We annotated the predicted protein sequences by doing a BLAST (blastp) search against the NCBI nr protein database (released February 2013) with e-value cutoff of 0.00001. The results were showed in the Excel file (BLAST_GO_KEGG.xlsx)
The nr protein database which was maintained by NCBI is a collection of sequence from SwissProt, PIR, PDB, PRF. Entries with absolutely identical sequences have been merged.

The special term in the Excel file was Evalue, Hsp_score and Percentage_of_identity. Evalue, Expect value indicated the validity of the match: the smaller the Evalue, the more likely the match was "good" and represents real similarity rather than a chance match. While the lower the Evalue was closer to "0", the more "significant" the match was. Hsp_score, high scoring segment pair, was the score of local optimal alignment. Percentage_of_identity meant the percentage of identity between query sequence and reference sequence.

To perform the functional annotation of predicted genes. Hits’ accession numbers were mapped to GO terms by querying GO database. These candidate terms were filtered with a given rule. A GO term which was most specific and has no conflicts in its parents in the GO tree would be chosen as ORF’s term provided by its hits. GO term distributions at level 2 for each GO type were calculated and saved as tables in a Excel file. Each row in the table showed the number of sequences at the nodes of level 2. Detailed annotations for individual ORF sequences could be also found in the same Excel file (BLAST_GO_KEGG.xlsx).




Fig4. The pi chart of GO term distributions at level 2 in biological process.


Table5. The GO term distributions at level 2 in biological process.

biological process GO:0008152
metabolic process 686
biological process GO:0009987
cellular process 635
biological process GO:0071840
cellular component organization or biogenesis 95
biological process GO:0065007
biological regulation 92
biological process GO:0051179
localization 86
biological process GO:0050896
response to stimulus 78
biological process GO:0023052
signaling 27
biological process GO:0032502
developmental process 12
biological process GO:0015976
carbon utilization 6
biological process GO:0040011
locomotion 3
biological process GO:0051704
multi-organism process 3




Fig5. The pi chart of GO term distributions at level 2 in cellular component.


Table6. The GO term distributions at level 2 in cellular component.

cellular component GO:0005623
cell 291
cellular component GO:0032991
macromolecular complex 139
cellular component GO:0016020
membrane 105