Figure 1.Experiments pipeline
Poly-A containing mRNA molecules are purified using poly-T oligo-attached magnetic beads from eukaryote (prokaryocyte can be treated with kit to remove rRNA before next step). Fragmentation mix is added to fragments the mRNA. The cDNA is synthesized by using random Hexamer priming. The second-strand is generated to create double-strand cDNA . Purify the cDNA templates by the use of qiagen kit followed by end repair, poly A tailing and adaptor connection. At last, the library could be sequenced using IlluminaHiSeq™ 2500.
Figure 2. Bioinformatics pipeline
|Total Illumina Reads||
|Total Reads after QT||
|Cleaned Read Length||
|Total Clean Base||
To check the reads quality after sequencing, nucleotides with low quality scores (< 13) are firstly trimmed from the sequence reads. First step in the trim process is to convert the quality score (Q) to error probability. Next, for every base a new value is calculated:
This value will be negative for low quality bases, where the error probability is high. For every base, we calculate the running sum of this value. If the sum drops below zero, it is set to zero. The part of the sequence to be retained after trimming is the region between the first positive value of the running sum and the highest value of the running sum. Everything before and after this region will be trimmed off. A read will be completely removed if the score never makes it above zero. In addition, if the read length is shorter than 35bp, the read will be discarded.
To perform de novo assembly, we pooled clean reads, which were quality trimmed and removed control sequences from samples and adopt Trinity [Grabherr MG et al., 2011]. Trinity, developed at the Broad Institute and the Hebrew University of Jerusalem, represents a novel method for the efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. It combines three independent software modules: Inchworm, Chrysalis, and Butterfly, applied sequentially to process large volumes of RNA-seq reads. Trinity partitions the sequence data into many individual de Bruijn graphs, each representing the transcriptional complexity at at a given gene or locus, and then processes each graph independently to extract full-length splicing isoforms and to tease apart transcripts derived from paralogous genes. It also performs best with strand-specific data, in which case sense and antisense transcripts can be resolved.
All contig sequences generated by Trinity are in FASTA format (transcripts.fa) and the description line of the FASTA format is like “>comp1_c0_seq1 len=2364”. "comp1_c0" can be regarded as genetic region 1. If this genetic region have three alternative forms, the other two transcripts are described to ">comp1_c0_seq2" and ">comp1_c0_seq3",respectively. Here, we pick the contig sequence with the longest one at the same genetic region as the unigene (e.g. unigenes.fa). The statistic of do novo assembly shows in Table 3 and 4 and the length distribution of unigenes shows in Figure 3.
Figure 3. Length distribution of unigenes
Table 2. The De novo statistic of all contigs sequences
|Max contig length||6,532|
|Min contig length||203|
|Avg contig length||693.01|
|Total contig length||48,935,469|
|Number of contig||70,846|
|Number of contig >= 300b||50,946|
|Total bases in contigs > 300b||44,113,883|
|The percentage of contig length >= 300b||90.10%|
|Number of contig >= 1kb||16,039|
|Total bases in contigs > 1kb||23,579,459|
|The percentage of contig length >= 1kb||53.20%|
Table 3. The De novo statistic of unigenes
|Max contig length||6,532|
|Min contig length||203|
|Avg contig length||604.31|
|Total contig length||28,723,633|
|Number of contig||47,535|
|Number of contig >= 300b||30,423|
|Total bases in contigs > 300b||24,569,132|
|The percentage of contig length >= 300b||85.30%|
|Number of contig >= 1kb||8,303|
|Total bases in contigs > 1kb||12,899,433|
|The percentage of contig length >= 1kb||43.90%|
We map trimmed reads of the sample to the unigene sequences by bowtie2 and using gapped alignment mode [Langmead, et al., 2012]. This method is essential for variant discovery because the sequence reads may contain insertion-deletion polymorphism (INDEL). Without this alignment approach, a read may still be mapped onto the correct position but with consecutive mismatches at INDEL locations.
After alignment, we quantify gene expression by eXpress and show the results in EXCEL file (unigene_RPKM.xlsx) [Roberts, et al., 2012]. The special term in the EXCEL file is RPKM. RPKM, expected fragments per kilobase of transcript per million fragments sequenced, is defined [Trapnell, et al., 2010]:
The total exon reads is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. The mapped reads (million) is the total number of reads that after mapping have been mapped to the region of the gene. For paired-end reads, when two reads map as a pair, the pair is counted as one. If the pair is broken, none of the reads are counted. The exon length (KB) is calculated as the sum of the lengths of all exons annotated for the gene, divided by 1000.
Table 4. Alignment Summary
Total cleaned reads
Reads which mapped to Bo's unigene
Table4. Summary of gene annotation
Number of genes with hit
|Predicted genes with hit in NCBI nr database||
We annotated the predicted protein sequences by doing a BLAST (blastp) search against the NCBI nr protein database (released February 2013) with e-value cutoff of 0.00001. The results were showed in the Excel file (BLAST_GO_KEGG.xlsx).
The nr protein database which was maintained by NCBI is a collection of sequence from SwissProt, PIR, PDB, PRF. Entries with absolutely identical sequences have been merged.
The special term in the Excel file was Evalue, Hsp_score and Percentage_of_identity. Evalue, Expect value indicated the validity of the match: the smaller the Evalue, the more likely the match was "good" and represents real similarity rather than a chance match. While the lower the Evalue was closer to "0", the more "significant" the match was. Hsp_score, high scoring segment pair, was the score of local optimal alignment. Percentage_of_identity meant the percentage of identity between query sequence and reference sequence.
To perform the functional annotation of predicted genes. Hits’ accession numbers were mapped to GO terms by querying GO database. These candidate terms were filtered with a given rule. A GO term which was most specific and has no conflicts in its parents in the GO tree would be chosen as ORF’s term provided by its hits. GO term distributions at level 2 for each GO type were calculated and saved as tables in a Excel file. Each row in the table showed the number of sequences at the nodes of level 2. Detailed annotations for individual ORF sequences could be also found in the same Excel file (BLAST_GO_KEGG.xlsx).
Fig4. The pi chart of GO term distributions at level 2 in biological process.
Table5. The GO term distributions at level 2 in biological process.
Seqs in GO:0008152 (metabolic process)
Seqs in GO:0009987 (cellular process)
||cellular component organization or biogenesis||
Seqs in GO:0071840 (cellular component organization or biogenesis)
Seqs in GO:0065007 (biological regulation)
Seqs in GO:0051179 (localization)
||response to stimulus||
Seqs in GO:0050896 (response to stimulus)
Seqs in GO:0023052 (signaling)
Seqs in GO:0032502 (developmental process)
Seqs in GO:0015976 (carbon utilization)
Seqs in GO:0040011 (locomotion)
Seqs in GO:0051704 (multi-organism process)
Fig5. The pi chart of GO term distributions at level 2 in cellular component.
Table6. The GO term distributions at level 2 in cellular component.
Seqs in GO:0005623 (cell)
Seqs in GO:0032991 (macromolecular complex)
Seqs in GO:0016020 (membrane)