6.2 Alternative Splicing
我们使用 rMATS, 其最大的优点是使用方便。用户只需要提供 mapping 好的 .bam
文件和基因组注释 .gtf
文件即可。不像某些软件,还需要提供特制的 index(比如 MISO,笔者鼓捣了好几天都没弄好。)
1) Pipeline
2) Data Structure
2a) getting software & data
install software (already available in Docker)
data 1. 我们使用 PRJNA130865 中的两个样本:
2b) input
Format | Description | Notes |
---|---|---|
.bam |
将样本中的 Reads 比对到参考基因组 | - |
.gtf |
参考基因组注释文件 | - |
output
Format | Description | Notes |
---|---|---|
many TSV | all possible alternative splicing (AS) events derived from GTF and RNA-seq | - |
详细说明请参见 http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm#output
3) Running Steps
首先进入到容器(在自己电脑的 Terminal 中运行,详情请参见 这里):
docker exec -it bioinfo_tsinghua bash
以下步骤均在 /home/test/alter-spl/
下进行:
cd /home/test/alter-spl/
3a) 检查 read length
samtools view input/SRR065544_chrX.bam | cut -f 10 | \
perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1448805 35
samtools view input/SRR065545_chrX.bam | cut -f 10 | \
perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1964089 35
也就是说 read length 均为 35
3b) 运行程序
echo "input/SRR065544_chrX.bam" > input/b1.txt
echo "input/SRR065545_chrX.bam" > input/b2.txt
python2 /usr/local/rMATS-turbo-Linux-UCS4/rmats.py \
--b1 input/b1.txt --b2 input/b2.txt --gtf input/Mus_musculus_chrX.gtf --od output \
-t paired --readLength 35
第二行指定输入和输出文件(夹)。
第三行是一些必需参数:
- 这里我们的数据是 paired-end, 所以选择
-t paired
- 根据第一步,我们指定
--readLength 35
3c) 检查输出
输出文件位于 output/
中。
最重要的是以下两类文件:
AS_Event.MATS.JC.txt
: evaluates splicing with only reads that span splicing junctionsAS_Event.MATS.JCEC.txt
: evaluates splicing with reads that span splicing junctions and reads on target (striped regions on MATS home page figure)
其中,AS_Event
包含以下几种:
A5SS
: alternative 5' splice siteA3SS
: alternative 3' splice siteSE
: skipped exonMXE
: mutually exclusive exonsRI
: retained intron
For example, A5SS.MATS.JC.txt
includes alternative 5' splice site (A5SS) using only reads that span splicing junctions:
ID GeneID geneSymbol chr strand longExonStart_0base longExonEnd shortES shortEE flankingES flankingEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen PValue FDR IncLevel1 IncLevel2 IncLevelDifference
2 "ENSMUSG00000004221" "Ikbkg" chrX + 74427761 74427902 74427761 74427874 74432804 74433006 2 1 0 0 17 61 34 9.36092704494e-05 0.000468046352247 1.0 0.0 1.0
20 "ENSMUSG00000025332" "Kdm5c" chrX + 152271859 152271938 152271859 152271929 152272074 152272274 20 0 2 4 9 42 34 0.00591200967211 0.00985334945352 0.0 0.265 -0.265
63 "ENSMUSG00000031167" "Rbm3" chrX - 8143246 8143332 8143250 8143332 8142367 8142955 63 26 140 51 145 37 34 0.0299275020479 0.0374093775599 0.146 0.244 -0.098
84 "ENSMUSG00000037369" "Kdm6a" chrX + 18277375 18277563 18277375 18277546 18278625 18279936 84 4 5 0 5 50 34 0.00235425083108 0.0058856270777 0.352 0.0 0.352
124 "ENSMUSG00000025283" "Sat1" chrX - 155215119 155215684 155215600 155215684 155214032 155214134 124 2 23 4 32 68 34 0.549749061659 0.549749061659 0.042 0.059 -0.017
其中最重要的列意义如下:
IncFormLen | length of inclusion form, used for normalization |
SkipFormLen | length of skipping form, used for normalization |
P-Value | Significance of splicing difference between two sample groups. (Only available if statistical model is on) |
FDR | False Discovery Rate calculated from p-value. (Only available if statistical model is on) |
IncLevel1 | inclusion level for SAMPLE_1 replicates (comma separated) calculated from normalized counts |
IncLevel2 | inclusion level for SAMPLE_2 replicates (comma separated) calculated from normalized counts |
IncLevelDifference | average(IncLevel1) - average(IncLevel2) |
4) Tips/Utilities
4a) read more
rMATS is introduced in "rMATS: Robust and Flexible Detection of Differential Alternative Splicing from Replicate RNA-Seq Data" in PNAS
读者可参考以下文献,探索其他的 alternative splicing 分析软件
- A survey of software for genome-wide discovery of differential splicing in RNA-Seq data
- A survey of computational methods in transcriptome-wide alternative splicing analysis
或者使用 SAJR introduced in this paper in Nature Protocol
4b) generate .bam
file
install hisat, bamtools, samtools
hisat 下载后解压到当前目录下,另外两个软件在 Docker 中已经装好
get genome
- get raw data
now your working directory looks like this
. ├── chromFa.tar.gz ├── hisat2-2.1.0 ├── Mus_musculus.GRCm38.93.gtf.gz ├── SRR065544_1.fastq.gz ├── SRR065544_2.fastq.gz ├── SRR065545_1.fastq.gz ├── SRR065545_2.fastq.gz
make hisat index
# extract X chromosome sequence tar -xz -f chromFa.tar.gz chrX.fa mv chrX.fa Mus_musculus_chrX.fa # use only X chromosome zcat Mus_musculus.GRCm38.93.gtf.gz | grep -P '(#!)|(X\t)' > Mus_musculus_chrX.gtf # make hisat index hisat2_extract_splice_sites.py Mus_musculus_chrX.gtf > Mus_musculus_chrX.ss hisat2_extract_exons.py Mus_musculus_chrX.gtf > Mus_musculus_chrX.exon mkdir hisat2_indexes hisat2-2.1.0/hisat2-build -p 4 \ --ss Mus_musculus_chrX.ss --exon Mus_musculus_chrX.exon \ Mus_musculus_chrX.fa hisat2_indexes/Mus_musculus_chrX
mapping
# mapping hisat2-2.1.0/hisat2 -p 4 --dta \ -S SRR065544_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \ -1 SRR065544_1.fastq.gz -2 SRR065544_2.fastq.gz hisat2-2.1.0/hisat2 -p 4 --dta \ -S SRR065545_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \ -1 SRR065545_1.fastq.gz -2 SRR065545_2.fastq.gz # covert to .bam samtools sort -@ 4 -o SRR065544_chrX_raw.bam SRR065544_chrX.sam samtools sort -@ 4 -o SRR065545_chrX_raw.bam SRR065545_chrX.sam # filter only mapped reads bamtools index -in SRR065544_chrX_raw.bam bamtools index -in SRR065545_chrX_raw.bam bamtools filter -isMapped true -in SRR065544_chrX_raw.bam \ -out SRR065544_chrX.bam bamtools filter -isMapped true -in SRR065545_chrX_raw.bam \ -out SRR065545_chrX.bam
5) Homework and more
- 为了鉴定 CUGBP1 对 mRNA isoform 的调控,科学家在 C2C12 小鼠成肌细胞(myoblast)中分别表达空载体(SRR065546)和含有干扰 CUGBP1 的 shRNA 的载体(SRR065547)。请同学们至 这里 下载
.bam
输入文件(只含有 map 到 X 染色体的 reads),探索在 X 染色体上存在 differential alternative splicing 的基因。(需要上交代码和输出结果中所有以.MATS.JCEC.txt
结尾的文件) - (optional) 阅读 4a) read more 中给出的 rMATS 的文献(https://doi.org/10.1073/pnas.1419161111),简要阐述 rMATS 找到 differential alternative splicing 的原理(只需解释 Unpaired Replicates 的情形即可)。