First Chapter

1.1 Sequence Alignment - Blast

Blast Instruction

I. Basics


1. Understand common file formats of sequence

FASTA format.fasta or .fa )

>gi|47115317|emb|CAG28618.1| VIM [Homo sapiens] MSTRSVSSSSYRRMFGGPGTASRPSSSRSYVTTSTRTYSLGSALRPSTSRSLYASSPGGVYATRSSAVRL

The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (optional). Normally, identifiers are simply protein accession, name or Entrez gi's (e.g., Q5I7T1, AG10B_HUMAN, 129295), but a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted. Any arbitrary user-specified sequence identifier can also be used (e.g., CLONE00073452).

2.Install blast and subroutines: blastn, blastpblastx, ...

Note: May have been installed, try to enter blastn to check if there were already installed blast

Method 1- Automatic installation:

Linux automatically installed: sudo apt-get install software-name

sudo apt-get install ncbi-blast+ #password: cs

Method 2 - Manual installation:

[Download link](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/" \t "_blank)

down the following software:

a. 32-bit computer: ncbi-blast-2.2.28+-ia32-linux.tar.gz

b. 64-bit computer: ncbi-blast-2.2.28+-x64-linux.tar.gz

(You can check whether the machine type is 32 or 64-bit by command ‘uname –a’)

3. Read the usage of blast

blastn –h or blastn --help

4. Go to working directory

cd # go to the default home directory

cd directory # ‘tab’ can be automatically completed the directory or file name, recommend frequently used

The general procedures are followed by the input file location and name, inform the program input and output:

II. Align two sequences


Protein sequence alignment

present working directory: /home/cs/Bioinfo_Lab/1.Sequence/1.seq_align/blast/protein/

blastp -query ./VIM.fasta -subject ./NMD.fasta -out output.blastp

VIM.fasta and NMD.fasta are two subspecies enzymes of the metal beta enzyme family respectively.

DNA sequence alignment (blastn)

present working directory: /home/cs/Bioinfo_Lab/1.Sequence/1.seq_align/blast/dna

blastn -query ./H1N1-HA.fasta -subject ./H7N9-HA.fasta -out output.blastn

H1N1-HA.fasta and H7N9-HA.fasta

III. Align a sequence with remote database

Protein sequence alignment to PDB

present working directory: /home/cs/Bioinfo_Lab/1.Sequence/1.seq_align/blast/protein/

blastp -query ./VIM.fasta -db pdb -remote -out protein_remote.blastp

DNA sequence alignment to nr

present working directory: /home/cs/Bioinfo_Lab/1.Sequence/1.seq_align/blast/dna

blastn -query ./H1N1-HA.fasta -db nr -remote -out dna_remote.blastn

IV. Align a sequence with self-defined database

Search Yeast.fasta sequence in yeast genome

present working directory: /home/cs/Bioinfo_Lab/1.Sequence/1.seq_align/blast/

Step 1: Database construction

Clear the file remaining:

rm ./database/*

Construct database:

makeblastdb -dbtype nucl -in ./dna/YeastGenome.fa -out ./database/YeastGenome

  • -dbtype: the type of database(nucl, prot)
  • -in: sequence file
  • -out: name prefix of database

Step 2: Alignment

blastn -query ./dna/Yeast.fasta -db ./database/YeastGenome -out ./dna/Yeast.blastn

2.1 Sequence alignment - Bowtie

Blast Instruction

I. Basic


1. File format of raw reads

FASTQ Format (.fastq or .fq)

FASTQ format stores sequences and Phred qualities in a single file. It is concise and compact. FASTQ is first widely used in the Sanger Institute and therefore we usually take the Sanger specification and the standard FASTQ format, or simply FASTQ format. Although Solexa/Illumina read file looks pretty much like FASTQ, they are different in that the qualities are scaled differently. In the quality string, if you can see a character with its ASCII code higher than 90, probably your file is in the Solexa/Illumina format.

Example

@EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCC

+

;;3;;;;;;;;;;;;7;;;;;;;88

2. Download the indexed genome files

Download link

3. More genome file format

See more at [Reference link](http://genome.ucsc.edu/FAQ/FAQformat.html" \t "_blank):[http://genome.ucsc.edu/FAQ/FAQformat.html](http://genome.ucsc.edu/FAQ/FAQformat.html" \t "_blank)

4. Install bowtie

Note: May have been installed, try to enter bowtie to check if there were already installed blast

If haven’t installed, click [Download link](http://sourceforge.net/projects/bowtie-bio/files/bowtie/1.0.0/" \t "_blank) to install.

x86_64: bowtie-1.0.0-linux-x86_64.zip
i386 or i686 (32): bowtie-1.0.0-linux-i386.zip

II. Mapping using Bowtie

yeast and e.coli genome:

present working directory: /home/cs/Bioinfo_Lab/1.Sequence/1.seq_align/bowtie/

commod:

bowtie -v 2 -m 10 --best --strata ./BowtieIndex/YeastGenome -f THA1.fa -S THA1.sam bowtie -v 1 -m 10 --best --strata ./bowtie-src/indexes/e_coli -q e_coli_1000_1.fq -S e_coli_1000_1.sam

-v report end-to-end hits with <=v mismatches; ignore qualities

-m suppress all alignments if > exist (def: no limit)

-M like -m, but reports 1 random hit (MAPQ=0); requires --best

--best hits guaranteed best stratum; ties broken by quality

--strata hits in sub-optimal strata aren't reported (requires --best)

-f raw reads file (.fasta)

-q raw reads file(.fastaq)

-S output file(.sam)

III. Format conversion

Transfer .sam to .bed file to facilitate other software processing(like IGB)

command:

perl sam2bed.pl THA1.sam > THA1.bed

when upload a .bed file to Genome Browser, if the file is too large or the MT chromosome can’t be recognized, you can use the following method:

grep -v chrmt THA1.bed > THA1_new.bed #output a file without chromosome MT

or

grep chrI THA1.bed > THA1_chrI.bed #output a file only consist chromosome I

bed format (separated by tab)

1.

chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).

2.

chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.

3.

chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99. The additional optional BED fields are:

4.

name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.

5.

score - A score between 0 and 1000.

2.3 Task & Assignment

I. Task


1.Learn sequence file format (fasta & fastq)

FASTA format:http://en.wikipedia.org/wiki/FASTA_format

FASTQ format:http://en.wikipedia.org/wiki/FASTQ_format

2.Run blast and bowtie at the Terminal

3.Write a script file (run.sh) to auyomatically and continuously run a blast task and a bowtie task.

4.Convert the .sam file to .bed file format and browse on the UCSC Genome Browser.

Reference

samtools:[http://samtools.sourceforge.net/](http://samtools.sourceforge.net/" \t "_blank)

[sequence file format](http://www.genome.ucsc.edu/FAQ/FAQformat.html" \t "_blank):http://www.genome.ucsc.edu/FAQ/FAQformat.html

[UCSC Genome Browser](https://genome.ucsc.edu/cgi-bin/hgGateway" \t "_blank):[https://genome.ucsc.edu/cgi-bin/hgGateway](https://genome.ucsc.edu/cgi-bin/hgGateway" \t "_blank)

II. Assignment


  1. Explain and compare the fasta and fastq format, point out the differences.
  2. Explain the blast output results.
  3. Learn to browse your uploaded bed file on the UCSC Genome Browser and hand in the screenshot.

Note: when upload a .bed file to Genome Browser, if the file is too large or the MT chromosome can’t be recognized, you can use the following method:

grep -v chrmt THA1.bed > THA1_new.bed #output a file without chromosome MT

or

grep chrI THA1.bed > THA1_chrI.bed #output a file only consist chromosome I

results matching ""

    No results matching ""