RNA-Seq advanecd analysis


Demo Report

exp_pipe

Figure 1. Experiment pipeline of RNA-Seq

mRNA molecules with poly-A tails were purified using poly-T oligo-attached magnetic beads from eukaryote (prokaryocyte could be treated with kit to remove rRNA before next step). The cDNA was synthesized by using these mRNA fragments as reverse transcription templates. The cDNA was then purified by the use of Qiagen Purification Kit and followed by performing end repair, A-tailing, and PCR amplification. After these procedures, the library could be sequenced using IlluminaHiSeq™ 2500.

bioinfo_pipe

Fig2. Bioinformatics analysis pipeline

Table1. Sequencing Summary

Run
Biological Sample-1
Biological Sample-2
Data Required (reads)
10,000,000
10,000,000
Total Reads
11,151,815
11,997,258
Read Length
51
51
Total Base
568,742,565
611,860,158
Number of Reads After Trim (paired)
11,018,703
11,861,189
Avg. Length After Trim
50.8
50.8
Total Base
559,750,112
602,548,401
Run
Biological Sample-3
Biological Sample-4
Data Required (reads)
10,000,000
10,000,000
Total Reads
11,372,864
12,843,546
Read Length
51
51
Total Base
580,016,064
655,020,846
Number of Reads After Trim (paired)
11,246,642
12,695,766
Avg. Length After Trim
50.8
50.8
Total Base
571,329,414
644,944,913

First step in the trim process was converting the quality score (Q) to error probability. Next, for every base a new value was calculated:

This value would be negative for low quality bases, where the error probability was high. For every base, we calculated the running sum of this value. If the sum dropped below zero, it was set to zero. The part of the sequence between the first positive value of the running sum and the highest value of the running sum was retained. Everything before and after this region was trimmed off. In addition, if the read length was shorter than 35bp, the read would also be discarded.

Table2. RNA reads mapping statstics

Run
Biological Sample-1
Biological Sample-2
Biological Sample-3
Biological Sample-4
Mapped Reads
9,579,330
10,673,474
10,225,553
11,207,637
Unmapped reads
1,439,373
1,187,715
1,021,089
1,488,129

We mapped trimmed reads of each sample to the reference genome by using gapped alignment and soft clip strategy to calculate gene expression. A read is mapped onto the reference if it has at least 80% similarity in matched region, and the matched region is at least 90% of the read length fraction.

Table3. Number of SNPs/INDELs in each sample

Run
Biological Sample-1
Biological Sample-2
Biological Sample-3
Biological Sample-4
No. of SNPs
155
174
146
163
No. of INDELs
0
0
0
0

The CLC Genomics Workbench Reference Probabilistic variant detection function was used to analysis SNP and INDEL. The Probabilistic Variant Caller is performed based on the algorithm combines a Bayesian model with a Maximum Likelihood approach to calculate prior and error probabilities for the Bayesian model. After prior and error probabilities have been calculated, the probability for each combination of alleles (e.g. A/G) after observing a certain combination of nucleotides from the reads at every position in the genome is determined. This probability is then used to determine which of the allele combinations is the most likely for each position. If the base or bases identified do not match the base at that position in the reference genome, a variation is called. The types of variants include Single Nucleotide Variants (SNVs), insertions, deletions and MNVs (Multiple Nucleotide Variants). A variant is called if a variant probability is larger than 90% with a minimum coverage of 10 and required presence in both forward and reverse reads. The detailed SNP and INDEL information is included in the EXCEL files in the analysis\SNP_INDEL\ folder.

Reporting of amino acid changes in SNP and INDEL detection now follows nomenclature for the description of sequence variants. Please link to the website for detailed explanation (http://www.hgvs.org/mutnomen/recs-prot.html). Briefly, “p.” is used to indicate description at protein level; “?” used for unknown (rather than non-standard “Unknown”); “=” used to denote an allele which agrees with the reference sequence (rather than missing entries or entries like Ala45Ala); “[…]” used around, -seperated lists of changes, each change coming from a different CDS annotation; “[…];[…]” scheme used to separate multiple alleles at same site.

The original gene expression values often need to be transformed and/or normalized in order to ensure that samples are comparable and assumptions on the data for analysis are met [Allison et al., 2006]. These are essential requirements for carrying out a meaningful analysis. Finally, we summarize those values about mRNA expression into EXCEL files in the analysis\RPKM\ folder. The special term in the EXCEL file is RPKM. RPKM, Reads Per Kilobase of exon model per Million mapped reads, is defined [Mortazavi et al., 2008]:
RPKM

The total exon reads is the number of reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. The mapped reads (million) is the total number of reads that after mapping have been mapped to the region of the gene. The exon length (KB) is calculated as the sum of the lengths of all exons annotated for the gene, divided by 1000. Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene.

Figure. Heatmap Analysis



Table4. Number of Differential expression mRNA (p-value < 0.05)

Run

Number of Differential expression mRNA (p-value < 0.05)

Biological Sample-1 & Biological Sample-3

457

Biological Sample-2 & Biological Sample-4

1,310

Biological Sample-2 & Biological Sample-1

958

Biological Sample-4 & Biological Sample-3

1,458

We utilize the R package DESeq to analyze the statistic test for differential expression by using the negative binomial distribution to model the null distribution of the read count data [Anders et al., 2010]. The negative binomial distribution is a generalization of the Poisson model that allows to model biological variance correctly. The detailed Differemtial expression information is included in the EXCEL files in the analysis\Differential_expression\ folder.

The Gene Ontology project provides ontology of defined GO terms representing gene product properties. The ontology covers three domains: cellular component, the parts of a cell or its extracellular environment; molecular function, the elemental activities of a gene product at the molecular level; and biological process, operations or sets of molecular events with a defined beginning and end.

For analysis the gene ontoloogy of the 4 sets of differential expressed gene, we first performed the a BLAST (blastp) search against the NCBI nr protein database of Viridiplantae (released February 2013) using reference genome (Arabidopsis thaliana) protein amino acid sequences as query with e-value cutoff of 10^-5. The results were showed in the Excel file (analysis\blastp.xlsx). The NR protein database which was maintained by NCBI is a collection of sequence from SwissProt, PIR, PDB, PRF. Entries with absolutely identical sequences have been merged.

After BLAST, the results of 4 gene sets were picked out and the hits’ accession numbers were then mapped to GO terms by querying GO database. These candidate terms were filtered with a given rule. A GO term which is most specific and has no conflicts in its parents in the GO tree will be chosen as ORF’s term provided by its hits. GO term distributions at level 2 for each GO type were calculated and saved as tables in an excel file located in GO_KEGG folder of each gene set (eg. analysis\GO_KEGG\Diff_Biological Sample-1_Biological Sample-3_gene_set\GO.xlsx). Each row in the table shows the number of sequences at the nodes of level 2. GO term distributions at all level for each GO type were calculated and saved as tables in the same excel file.

 

bp_lv2

Fig3. The pi chart of GO term distributions at level 2 in biological process

 

Table5. The GO term distributions at level 2 in biological process.

Category GO-id GO-term Count
biological process GO:0009987
cellular process 318
biological process GO:0008152
metabolic process 295
biological process GO:0050896
response to stimulus 232
biological process GO:0065007
biological regulation 180
biological process GO:0051179
localization 128
biological process GO:0032502
developmental process 102
biological process GO:0071840
cellular component organization or biogenesis 101
biological process GO:0032501
multicellular organismal process 100
biological process GO:0051704
multi-organism process 81
biological process GO:0023052
signaling 69
biological process GO:0002376
immune system process 51
biological process GO:0000003
reproduction 46
biological process GO:0016265
death 30
biological process GO:0040007
growth 28
biological process GO:0048511
rhythmic process 10
biological process GO:0008283
cell proliferation 9
biological process GO:0043473
pigmentation 8
biological process GO:0001906
cell killing 5
biological process GO:0015976
carbon utilization 4
biological process GO:0022610
biological adhesion 3
biological process GO:0040011
locomotion 1

 

cc_lv2

Fig4. The pi chart of GO term distributions at level 2 in cellular component.

 

Table6. The GO term distributions at level 2 in cellular component.

Category GO-id GO-term Count
cellular component GO:0005623
cell 283
cellular component GO:0043226
organelle 206
cellular component GO:0016020
membrane 136
cellular component GO:0032991
macromolecular complex 94
cellular component GO:0031974
membrane-enclosed lumen 54
cellular component GO:0005576
extracellular region 48
cellular component GO:0030054
cell junction 31
cellular component GO:0055044
symplast 31
cellular component GO:0031012
extracellular matrix 4

 

mf_lv2

Fig5. The pi chart of GO term distributions at level 2 in molecular function.

 

Table7. The GO term distributions at level 2 in molecular function.

Category GO-id GO-term Count
molecular function GO:0005488
binding 208
molecular function GO:0003824
catalytic activity 171
molecular function GO:0001071
nucleic acid binding transcription factor activity 31
molecular function GO:0005215
transporter activity 30
molecular function GO:0009055
electron carrier activity 14
molecular function GO:0030234
enzyme regulator activity 9
molecular function GO:0005198
structural molecule activity 6
molecular function GO:0004872
receptor activity 5
molecular function GO:0016209
antioxidant activity 5
molecular function GO:0060089
molecular transducer activity 4
molecular function GO:0045735
nutrient reservoir activity 4
molecular function GO:0016247
channel regulator activity 2
molecular function GO:0000988
protein binding transcription factor activity 1

 

bp_lv2

Fig6. The pi chart of GO term distributions at level 2 in biological process.

 

Table8. The GO term distributions at level 2 in biological process.

Category GO-id GO-term Count
biological process GO:0009987
cellular process 927
biological process GO:0008152
metabolic process 867
biological process GO:0050896
response to stimulus 726
biological process GO:0065007
biological regulation 570
biological process GO:0051179
localization 368
biological process GO:0032502
developmental process 331
biological process GO:0071840
cellular component organization or biogenesis 316
biological process GO:0032501
multicellular organismal process 312
biological process GO:0023052
signaling 271
biological process GO:0051704
multi-organism process 261
biological process GO:0002376
immune system process 163
biological process GO:0000003
reproduction 155
biological process GO:0040007
growth 118
biological process GO:0016265
death 114
biological process GO:0043473
pigmentation 37
biological process GO:0048511
rhythmic process 30
biological process GO:0008283
cell proliferation 13
biological process GO:0022610
biological adhesion 13
biological process GO:0015976
carbon utilization 8
biological process GO:0001906
cell killing 4

 

cc_lv2

Fig7. The pi chart of GO term distributions at level 2 in cellular component.

 

Table9. The GO term distributions at level 2 in cellular component.

Category GO-id GO-term Count
cellular component GO:0005623
cell 835
cellular component GO:0043226
organelle 568
cellular component GO:0016020
membrane 429
cellular component GO:0032991
macromolecular complex 208
cellular component GO:0031974
membrane-enclosed lumen 137
cellular component GO:0030054
cell junction 123
cellular component GO:0055044
symplast 122
cellular component GO:0005576
extracellular region 106
cellular component GO:0031012
extracellular matrix 7

 

mf_lv2

Fig8. The pi chart of GO term distributions at level 2 in molecular function.

 

Table10. The GO term distributions at level 2 in molecular function.

Category GO-id GO-term Count
molecular function GO:0005488
binding 603
molecular function GO:0003824
catalytic activity 576