6.2 Alternative Splicing

我们使用 rMATS, 其最大的优点是使用方便。用户只需要提供 mapping 好的 .bam 文件和基因组注释 .gtf 文件即可。不像某些软件,还需要提供特制的 index(比如 MISO,笔者鼓捣了好几天都没弄好。)

1) Pipeline

2) Data Structure

2a) getting software & data

  1. install software (already available in Docker)

    MATS

  2. data 1. 我们使用 PRJNA130865 中的两个样本:

    • SRR065545: C2C12 with control shRNA vector
    • SRR065544: C2C12 with shRNA against CUGBP1

      我们已经准备好 .bam 文件(仅包含 mapping 到 X 染色体上的部分),位于 Docker 中的 /home/test/alter-spl/input。读者也可以点击相应链接下载原始的 FASTQ 文件) 1. 我们从 Ensembl 下载了 Mus musculus 的基因组注释(GRCm38/mm10),位于 Docker 中的 /home/test/alter-spl

2b) input

Format Description Notes
.bam 将样本中的 Reads 比对到参考基因组 -
.gtf 参考基因组注释文件 -

output

Format Description Notes
many TSV all possible alternative splicing (AS) events derived from GTF and RNA-seq -

详细说明请参见 http://rnaseq-mats.sourceforge.net/rmats4.0.2/user_guide.htm#output

3) Running Steps

首先进入到容器(在自己电脑的 Terminal 中运行,详情请参见 这里):

docker exec -it bioinfo_tsinghua bash

以下步骤均在 /home/test/alter-spl/ 下进行:

cd /home/test/alter-spl/

3a) 检查 read length

samtools view input/SRR065544_chrX.bam | cut -f 10 | \
    perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1448805 35
samtools view input/SRR065545_chrX.bam | cut -f 10 | \
    perl -ne 'chomp;print length($_) . "\n"' | sort | uniq -c
1964089 35

也就是说 read length 均为 35

3b) 运行程序

echo "input/SRR065544_chrX.bam" > input/b1.txt
echo "input/SRR065545_chrX.bam" > input/b2.txt
python2 /usr/local/rMATS-turbo-Linux-UCS4/rmats.py \
    --b1 input/b1.txt --b2 input/b2.txt --gtf input/Mus_musculus_chrX.gtf --od output \
    -t paired --readLength 35

第二行指定输入和输出文件(夹)。
第三行是一些必需参数:

  • 这里我们的数据是 paired-end, 所以选择 -t paired
  • 根据第一步,我们指定 --readLength 35

3c) 检查输出

输出文件位于 output/ 中。

最重要的是以下两类文件:

  • AS_Event.MATS.JC.txt: evaluates splicing with only reads that span splicing junctions
  • AS_Event.MATS.JCEC.txt: evaluates splicing with reads that span splicing junctions and reads on target (striped regions on MATS home page figure)

其中,AS_Event 包含以下几种:

  1. A5SS: alternative 5' splice site
  2. A3SS: alternative 3' splice site
  3. SE: skipped exon
  4. MXE: mutually exclusive exons
  5. RI: retained intron

For example, A5SS.MATS.JC.txt includes alternative 5' splice site (A5SS) using only reads that span splicing junctions:

ID    GeneID    geneSymbol    chr    strand    longExonStart_0base    longExonEnd    shortES    shortEE    flankingES    flankingEE    ID    IJC_SAMPLE_1    SJC_SAMPLE_1    IJC_SAMPLE_2    SJC_SAMPLE_2    IncFormLen    SkipFormLen    PValue    FDR    IncLevel1    IncLevel2    IncLevelDifference
2    "ENSMUSG00000004221"    "Ikbkg"    chrX    +    74427761    74427902    74427761    74427874    74432804    74433006    2    1    0    0    17    61    34    9.36092704494e-05    0.000468046352247    1.0    0.0    1.0
20    "ENSMUSG00000025332"    "Kdm5c"    chrX    +    152271859    152271938    152271859    152271929    152272074    152272274    20    0    2    4    9    42    34    0.00591200967211    0.00985334945352    0.0    0.265    -0.265
63    "ENSMUSG00000031167"    "Rbm3"    chrX    -    8143246    8143332    8143250    8143332    8142367    8142955    63    26    140    51    145    37    34    0.0299275020479    0.0374093775599    0.146    0.244    -0.098
84    "ENSMUSG00000037369"    "Kdm6a"    chrX    +    18277375    18277563    18277375    18277546    18278625    18279936    84    4    5    0    5    50    34    0.00235425083108    0.0058856270777    0.352    0.0    0.352
124    "ENSMUSG00000025283"    "Sat1"    chrX    -    155215119    155215684    155215600    155215684    155214032    155214134    124    2    23    4    32    68    34    0.549749061659    0.549749061659    0.042    0.059    -0.017

其中最重要的列意义如下:

IncFormLen length of inclusion form, used for normalization
SkipFormLen length of skipping form, used for normalization
P-Value Significance of splicing difference between two sample groups. (Only available if statistical model is on)
FDR False Discovery Rate calculated from p-value. (Only available if statistical model is on)
IncLevel1 inclusion level for SAMPLE_1 replicates (comma separated) calculated from normalized counts
IncLevel2 inclusion level for SAMPLE_2 replicates (comma separated) calculated from normalized counts
IncLevelDifference average(IncLevel1) - average(IncLevel2)

4) Tips/Utilities

4a) read more

rMATS is introduced in "rMATS: Robust and Flexible Detection of Differential Alternative Splicing from Replicate RNA-Seq Data" in PNAS

读者可参考以下文献,探索其他的 alternative splicing 分析软件

或者使用 SAJR introduced in this paper in Nature Protocol

4b) generate .bam file

  1. install hisat, bamtools, samtools

    hisat 下载后解压到当前目录下,另外两个软件在 Docker 中已经装好

  2. get genome

  3. get raw data
  4. now your working directory looks like this

    .
    ├── chromFa.tar.gz
    ├── hisat2-2.1.0
    ├── Mus_musculus.GRCm38.93.gtf.gz
    ├── SRR065544_1.fastq.gz
    ├── SRR065544_2.fastq.gz
    ├── SRR065545_1.fastq.gz
    ├── SRR065545_2.fastq.gz
    
  5. make hisat index

    # extract X chromosome sequence
    tar -xz -f chromFa.tar.gz chrX.fa
    mv chrX.fa Mus_musculus_chrX.fa
    
    # use only X chromosome
    zcat Mus_musculus.GRCm38.93.gtf.gz | grep -P '(#!)|(X\t)' > Mus_musculus_chrX.gtf
    
    # make hisat index
    hisat2_extract_splice_sites.py Mus_musculus_chrX.gtf > Mus_musculus_chrX.ss
    hisat2_extract_exons.py        Mus_musculus_chrX.gtf > Mus_musculus_chrX.exon
    
    mkdir hisat2_indexes
    hisat2-2.1.0/hisat2-build -p 4 \
        --ss Mus_musculus_chrX.ss --exon Mus_musculus_chrX.exon \
        Mus_musculus_chrX.fa hisat2_indexes/Mus_musculus_chrX
    
  6. mapping

    # mapping
    hisat2-2.1.0/hisat2 -p 4 --dta \
        -S SRR065544_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \
        -1 SRR065544_1.fastq.gz -2 SRR065544_2.fastq.gz
    hisat2-2.1.0/hisat2 -p 4 --dta \
        -S SRR065545_chrX.sam -x hisat2_indexes/Mus_musculus_chrX \
        -1 SRR065545_1.fastq.gz -2 SRR065545_2.fastq.gz
    
    # covert to .bam
    samtools sort [email protected] 4 -o SRR065544_chrX_raw.bam SRR065544_chrX.sam
    samtools sort [email protected] 4 -o SRR065545_chrX_raw.bam SRR065545_chrX.sam
    
    # filter only mapped reads
    bamtools index -in SRR065544_chrX_raw.bam
    bamtools index -in SRR065545_chrX_raw.bam
    
    bamtools filter -isMapped true -in SRR065544_chrX_raw.bam \
        -out SRR065544_chrX.bam
    bamtools filter -isMapped true -in SRR065545_chrX_raw.bam \
        -out SRR065545_chrX.bam
    

5) Homework and more

  1. 为了鉴定 CUGBP1 对 mRNA isoform 的调控,科学家在 C2C12 小鼠成肌细胞(myoblast)中分别表达空载体(SRR065546)和含有干扰 CUGBP1 的 shRNA 的载体(SRR065547)。请同学们至 这里 下载 .bam 输入文件(只含有 map 到 X 染色体的 reads),探索在 X 染色体上存在 differential alternative splicing 的基因。(需要上交代码和输出结果中所有以 .MATS.JCEC.txt 结尾的文件)
  2. (optional) 阅读 4a) read more 中给出的 rMATS 的文献(https://doi.org/10.1073/pnas.1419161111),简要阐述 rMATS 找到 differential alternative splicing 的原理(只需解释 Unpaired Replicates 的情形即可)。

results matching ""

    No results matching ""