LncRNA-Seq analysis


Demo Report

experiment_pipe

Figure 1.Experiments pipeline

Ribo-Zero kits remove ribosomal RNA (rRNA) using a hybridization/bead capture procedure that selectively binds rRNA species using biotinylated capture probes. The probe:rRNA hybrid is then captured by magnetic beads and removed using a magnet, leaving the desired rRNA-depleted RNA in solution. After removing rRNA, Fragmentation mix is added to fragments the mRNA. The first-strand cDNA is synthesized by using these mRNA fragments as RT templates. The second-strand cDNA is synthesized by the use of dNTP (dUTP replace dTTP), buffer, RNaseH and DNA polymerase I. Purify the cDNA templates by the use of qiagen kit followed by end repair, poly A tailing and adaptor connection. The samples then treated with USER™ (Uracil-Specific Excision Reagent) Enzyme to digest the antisense strand DNA followed by PCR reaction [Wang et al., 2011]. At last, the library could be sequenced using IlluminaHiSeq™ 2500.

 

bioinformatic_pipe

Figure 2. Bioinformatics pipeline

Table 1. Sequencing summary
  Biological Sample-1 Biological Sample-2 Biological Sample-3 Biological Sample-4 Biological Sample-5
Total Illumina Reads 49,704,530 49,704,530 54,456,610 54,664,062 45,046,426
Read Length 136 136 136 136 136
Total Base 6,759,816,080 6,759,816,080 7,406,098,960 7,434,312,432 6,126,313,936
Cleaned Reads 45,456,987 46,522,154 50,614,925 51,975,839 42,676,684
Cleaned Read Length 118.7 120.6 120.8 123.4 122.5
Cleaned Base 5,395,744,356 5,610,571,772 6,114,282,940 6,413,818,532 5,227,893,790

To check the reads quality after sequencing, nucleotides with low quality scores (< 13) are firstly trimmed from the sequence reads. First step in the trim process is to convert the quality score (Q) to error probability. Next, for every base a new value is calculated:


0.05-Error probability

This value will be negative for low quality bases, where the error probability is high. For every base, we calculate the running sum of this value. If the sum drops below zero, it is set to zero. The part of the sequence to be retained after trimming is the region between the first positive value of the running sum and the highest value of the running sum. Everything before and after this region will be trimmed off. A read will be completely removed if the score never makes it above zero. In addition, if the read length is shorter than 35bp, the read will be discarded.

 

After trimming , the cleaned sequence reads were mapped to the human genome using the publicly available packages STAR (release 2.3.0) (https://code.google.com/p/rna-star/), and Cufflinks (release 2.1.1) (http://cufflinks.cbcb.umd.edu). STAR and Cufflinks map known and novel splice junctions, use annotation files to compute which aligned sequences map to the known lncRNA and take into account transcript isoform diversity (alternative splicing). The lncRNA annotation file wes downloaded from Gencode Project (http://www.gencodegenes.org/). Cufflinks may be used with lncRNA annotation files to calculate overall lncRNA expression in terms of RPKM (reads per kilobase of exon per million mapped reads). The lncRNA gene expression show in EXCEL file (gene_expression.xlsx). The lncRNA isoform expression show in EXCEL file (isoform_expression.xlsx). After quantified lncRNA gene expression, Cufflinks will find significant changes in lncRNA expression between samples. The differential expression of lncRNA gene results show in EXCEL file (differential_expression(gene).xlsx). The differential expression of lncRNA isoform results show in EXCEL file (differential_expression(isoform).xlsx). The gene expression with significant change (q-value < 0.05) show in EXCEL file (significant-differentail-expression-gene.xlsx).

FPKM

 

The total fragments is the number of paired-end reads that have been mapped to a region in which an exon is annotated for the gene or across the boundaries of two exons or an intron and an exon for an annotated transcript of the gene. The mapped reads (million) is the total number of reads that after mapping have been mapped to the region of the gene. The exon length (KB) is calculated as the sum of the lengths of all exons annotated for the gene, divided by 1000. Each exon is included only once in this sum, even if it is present in more annotated transcripts for the gene. Besides, when counting the mapped reads to generate expression values, we need to decide how to handle paired reads. The standard behavior is this: if two reads map as a pair, the pair is counted as one. If the pair is broken, none of the reads are counted. The reasoning is that something is not right in this case, it could be that the transcripts are not represented correctly on the reference, or there are errors in the data. In general, more confidence is placed with an intact pair.

 

Long non-coding RNA genes annotated in GENCODE Project:13220

 

  Biological Sample-1 Biological Sample-2 Biological Sample-3 Biological Sample-4 Biological Sample-5
lncRNA with expression 5,711 6,171 6,229 5,552 5,163
lncRNA without expression 7,509 7,049 6,991 7,668 8,057

 

Figure 3. Quantification of lncRNA.

 

Figure 4.HeatMap of lncRNA expression.

 

 

  1. A. Dobin et al, STAR: ultrafast universal RNA-seq aligner Bioinformatics 2012; doi: 10.1093/bioinformatics/bts635
  2. Trapnell C, et al , Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation Nature Biotechnologydoi:10.1038/nbt.1621
  3. Harrow J et al, GENCODE: the reference human genome annotation for The ENCODE Project. Genome research 2012;22;9;1760-74 PUBMED: 22955987; PMC: 3431492; DOI: 10.1101/gr.135350.111
  • analysis/gene_expression.xlsx
  • analysis/isoform_expression.xlsx
  • analysis/differential_expression(gene).xlsx
  • analysis/differential_expression(isoform).xlsx
  • analysis/significant-differentail-expression-gene.xlsx