DNA-Seq advanecd analysis

Demo Report


Fig1. Experiment pipeline of DNA sequencing

Genomic DNA was fragmented by sonication. Fragment DNA was excised in the suitable size. After purification with Qiagen kit, the fragment DNA was end-repaired and A-tailed using the polymerase activity of klenow fragment. Indexed adapters were then ligated to the DNA fragments by DNA ligase followed by performing PCR reaction of 10 to 18 cycles to enrich the adapter-modified DNA fragments. After validating the libraries by QPCR, Expersion and Qubit, the library could be sequenced using IlluminaHiSeq™ 2500.


Fig2. Bioinformatics analysis pipeline

Table1. Sequencing Summary

Biological Sample
Data required (bp)
Total reads
Read length (bp)
Total bases (bp)
Number of reads after trimmed
Average read length after trimmed (bp)
Total bases after trimmed (bp)
Insert size (bp)
First step in the trim process was converting the quality score (Q) to error probability. Next, for every base a new value was calculated:

This value would be negative for low quality bases, where the error probability was high. For every base, we calculated the running sum of this value. If the sum dropped below zero, it was set to zero. The part of the sequence between the first positive value of the running sum and the highest value of the running sum was retained. Everything before and after this region was trimmed off. In addition, if the read length was shorter than 35bp, the read would also be discarded.


Table2. de novo assembly statstics

Biological Sample
Max contig length
Min contig length
Average contig length
Total contig length
Number of contigs
Number of contigs >= 300bp
Total bases in contigs >= 300bp
The percentage of contig length >= 300bp
Number of contigs >= 1kb
Total bases in contigs >= 1kb
The percentage of contig length >= 1kb
Number of contigs >= 10kb
Total bases in contigs >= 10kb
The percentage of contig length >= 10kb
Number of contigs >= 100kb
Total bases in contigs >= 100kb
The percentage of contig length >= 100kb
Number of 'N's
Number of 'N's in contigs >= 300bp
'N'-percentage in contigs >= 300bp
Number of 'N's in contigs >= 1kb
'N'-percentage in contigs >= 1kb
Number of 'N's in contigs >= 10kb
'N'-percentage in contigs >= 10kb
Number of 'N's in contigs >= 100kb
'N'-percentage in contigs >= 100kb
GC content

To complete the de novo assembly, we used Velvet [Zerbino et al., 2008] which uses de Bruijn graphs as the assembly algorithm. The basic idea of the assembly algorithm was to make a table of all sub-sequences of a certain length (called words) found in the reads. Given a word in the table, we could look up all the potential neighboring words. Typically, only one of the backward neighbors and one of the forward neighbors wiould be present in the table. A graph could then be made where each node was a word that was presented in the table and edges connected nodes that were neighbors. This wasd called a de Bruijn graph. In this way, simple contig sequences were created by using all the information that was in the read sequences. All contig sequences were in FASTA format (Biological Sample_assembly.fa). The description line of the FASTA format was like “>NODE_1_length_$number1_cov_$number2”. Note however that node lengths (length_$number1) were given in k-mers. To obtain the length in nucleotides of each node you simply need to add k - 1, where k is the word-length used in velveth. Here, we set k is 97. The coverage (cov_$number2) was provided in k-mer coverage.


Fig3. Insert size length distribution of paired-end reads

Table3. Summary of gene prediction

Number of genes
Gene prediction with GeneMark.hmm heuristic models

We then used the assembled contigs for gene finding by using GeneMark.hmm with heuristic approach [Besemer J., 1999]. The Heuristic algorithm builds an inhomogeneous Markov model of protein coding regions for an unknown sequence based on the observed relationships between the positional nucleotide frequencies and the global nucleotide frequencies observed in the analysis of 17 complete bacterial genomes. This heuristic approach can be used for sequences as small as 400 bp, making it especially useful in the analysis of small genomes, such as viruses and phages, for which there are not enough genes to build accurate models of higher order. After creating a model from the sequence data, GeneMark.hmm will use the model just created to analyze the sequence. In the resulting file of ORFs sequences (Biological Sample_gene_prediction.txt), every description line shows (1)contig name, (2)serial number of the ORF along the contig, (3)ORFs length, (4)strand of ORF and (5)beginning and ending position of the ORF, sequentially. The result was then parsed into 2 FASTA files (Biological Sample_gene_predict_nucleotide.fa & Biological Sample_gene_predict_protein.fa) containing either nucleotide or amino acid sequences.

Table4. Summary of gene annotation

Number of genes with hit
Predicted genes with hit in NCBI nr database

We annotated the predicted protein sequences by doing a BLAST (blastp) search against the NCBI nr protein database (released February 2013) with e-value cutoff of 0.00001. The results were showed in the Excel file (BLAST_GO_KEGG.xlsx)
The nr protein database which was maintained by NCBI is a collection of sequence from SwissProt, PIR, PDB, PRF. Entries with absolutely identical sequences have been merged.

The special term in the Excel file was Evalue, Hsp_score and Percentage_of_identity. Evalue, Expect value indicated the validity of the match: the smaller the Evalue, the more likely the match was "good" and represents real similarity rather than a chance match. While the lower the Evalue was closer to "0", the more "significant" the match was. Hsp_score, high scoring segment pair, was the score of local optimal alignment. Percentage_of_identity meant the percentage of identity between query sequence and reference sequence.

To perform the functional annotation of predicted genes. Hits’ accession numbers were mapped to GO terms by querying GO database. These candidate terms were filtered with a given rule. A GO term which was most specific and has no conflicts in its parents in the GO tree would be chosen as ORF’s term provided by its hits. GO term distributions at level 2 for each GO type were calculated and saved as tables in a Excel file. Each row in the table showed the number of sequences at the nodes of level 2. Detailed annotations for individual ORF sequences could be also found in the same Excel file (BLAST_GO_KEGG.xlsx).




Fig4. The pi chart of GO term distributions at level 2 in biological process.


Table5. The GO term distributions at level 2 in biological process.

biological process GO:0008152
metabolic process 686
biological process GO:0009987
cellular process 635
biological process GO:0071840
cellular component organization or biogenesis 95
biological process GO:0065007
biological regulation 92
biological process GO:0051179
localization 86
biological process GO:0050896
response to stimulus 78
biological process GO:0023052
signaling 27
biological process GO:0032502
developmental process 12
biological process GO:0015976
carbon utilization 6
biological process GO:0040011
locomotion 3
biological process GO:0051704
multi-organism process 3




Fig5. The pi chart of GO term distributions at level 2 in cellular component.


Table6. The GO term distributions at level 2 in cellular component.

cellular component GO:0005623
cell 291
cellular component GO:0032991
macromolecular complex 139
cellular component GO:0016020
membrane 105
cellular component GO:0043226
organelle 97
cellular component GO:0031974
membrane-enclosed lumen 21




Fig6. The pi chart of GO term distributions at level 2 in molecular function.


Table7. The GO term distributions at level 2 in molecular function.

molecular function GO:0003824
catalytic activity 694