4 Pipeline 4-RNA-seq
A slurm based schema for RNA-seq analysis to execute on linux clusters.
The purpose of this project to develop a list of scripts whose input and output will be folders containing either fastq or bam files. This schema contains easily customizable bash scripts and directory structure.
This pipeline and workflow is based on Taito.csc server batch scripts. The objective of this documentation is to make execution faster and reproducible as much as possible. The project folder will contain these folders before starting
- scripts : contains all scripts to run in taito server
- OUT : contains output files defined by
#SBATCH -o
in all scripts
- ERROR : contains error files defined in
#SBATCH -e
in all scripts
- commands : contains actual history of commands
- rawReads : should contain sequencing reads generated by the sequencing machine. This folder name could be anything.
4.0.1 Dependency
4.0.2 Installation
4-RNA-seq is required to download for every new project.
4.0.2.1 Download
For each experiment 4-RNA-seq pipeline needs to be downloaded separately. Let us download it to a directory named “day10” with following commands
cd $WRKDIR
cd DONOTREMOVE
git clone https://github.com/vondoRishi/4-RNA-seq day10
### Check what is in day10
tree -C day10
From now on day10 is the project space
4.0.2.2 Prepare the workspace
Make a directory rawreads
inside day10
and copy old fastq(.gz) files from day7
exercises to there.
mkdir day10/rawReads
cp day7/RNAseq_exercise/rawReads/*fastq.gz day10/rawReads
4.1 How to run
4.1.1 Download public resources
4.1.1.1 rRNA fasta files
Download few fasta files from sortmerna website and combine them (this is just for demonstration).
## move to project directory
cd day10
mkdir rRNAFasta
wget -P rRNAFasta https://github.com/biocore/sortmerna/raw/master/rRNA_databases/silva-arc-16s-id95.fasta
wget -P rRNAFasta https://github.com/biocore/sortmerna/raw/master/rRNA_databases/silva-arc-23s-id98.fasta
## Combine all rRNA fasta files in a single file
cat rRNAFasta/silva-arc-* > rRNAFasta/combined_rRNA.fasta
### Copy the path of combined_rRNA.fasta
find $PWD/rRNAFasta/
### reuse the mouse genome and gtf files downloaded earlier. Their paths are available at star_index.sh
Before execution please define the project variables in the 4-rna-seq.config
file. These values will be used by different scripts of this pipeline. Edit this file with nano
## There should not be any space around "="
project_name="test_name"
sequence_type="single" ## single ## for paired-end need to change the scripts manually
### parameter used by HTSeq; is data strand specific ?
stranded="yes" ### yes|reverse|no
### sortmerna parameter
sortMeRNA_ref="$WRKDIR/DONOTREMOVE/Mouse_genome/rRNA_operon/Mouse_ribosomal_operon.fasta"
### STAR aligner parameter
maxReadLength=10 ## this parameter should be set after finishing the QC
### parameter used by STAR and HTSeq
genome_file="$WRKDIR/DONOTREMOVE/Mouse_genome/Mus_musculus_GRCm38.fa"
gene_annotation="$WRKDIR/DONOTREMOVE/Mouse_genome/Mus_musculus.GRCm38.79.gtf"
### parameter used by TopHat / Cufflinks / Cuffdiff;
library_type="fr-secondstrand" ## fr-secondstrand|fr-firststrand|fr-unstranded
4.1.2 Step 1-3. QC and Filtering
After running each ‘sbatch’ command always check the sizes of output files, zero file size is absolute mark of something went wrong.
4.1.2.1 Step 1. QA of raw reads
Start QC ( quality checking) with Fastqc and Multiqc. In earlier class we have executed fastqc and multiqc separately. However the scripts/fastqc.sh executes both Fastqc and Multiqc internally.
sbatch --mail-user ur_email_at_domain scripts/fastqc.sh rawReads ## Don't use "rawReads/"
Input : any directory with fastq or fastq.gz files, here rawReads
Output : Output generated in input directory, here rawReads/rawReads.html and other fastqc.html files
4.1.2.2 Step 2. Filtering rRNA
sbatch --mail-user ur_email_at_domain scripts/sortmerna.sh rawReads sortMeRna
Input: directory with fastq files
Output: sortMeRna, the folder contains many different types of file. Fastq/fq files starting with non_Rna will be used in downstream analysis. Files with .log will be used by multiqc to summarize. The “rRna” fastq/fq and “.sam” files should be (re)moved from sortMeRna before next step.
- After (re)moving “rRna*.fq" files, the rest of the .fq files could be compressed.
sbatch --mail-user ur_email_at_domain scripts/compress_fastq.sh sortMeRna
Input: directory with fastq files
Output: same as input directory
- Now summarize the presence of rRNA.
sbatch --mail-user ur_email_at_domain scripts/fastqc.sh sortMeRna
4.1.2.3 Step 3. Quality trimming and adapter removal
Needs to edit scripts/trimmo.sh
for additional paramenters of Trimmomatic program
sbatch --mail-user ur_email_at_domain scripts/trimmo.sh sortMeRna trimmomatic
Input : directory with fastq or fastq.gz files
Output : directory with trimmed reads in fastq format
sbatch --mail-user ur_email_at_domain scripts/fastqc.sh trimmomatic
4.1.3 Step 4. Alignment of reads
This script will
1) index the genome in a folder star-genome_ann_Indices
,
2) align the reads from individual samples
3) run multiqc.
Confirm the parameters in file 4-rna-seq.config
maxReadLength
to maximum read length
genome_file
to path to reference genome
gene_annotation
path to gtf file
sbatch --mail-user ur_email_at_domain scripts/star.sh trimmomatic starAlign
Input: folder which contains the filtered reads; ex. good or sortMeRna
Output: Directory with bam files and multiqc report.
4.1.4 Step 5. Read count
Confirm the parameters in file 4-rna-seq.config
stranded
depending upon the library type
gene_annotation
path to gtf file
sbatch --mail-user ur_email_at_domain scripts/star_htseq-count.sh starAlign star_count
Input: Directory with bam files
Output: Directory with count values, star_count/htseq_*txt and quality report at star_count.html
4.1.5 Final report
Till now we have generated multiqc reports for every command or folder. Now to summarize all in one place execute. Edit multiqc multiqc_config.yaml
file if require
sbatch --mail-user ur_email_at_domain scripts/multiqc_slurm.sh
4.2 EXTRA
4.2.1 Alignment read viewer
Need to sort (uncomment for tophat output bams) and index bam files.
sbatch --mail-user ur_email_at_domain scripts/samtools_index.sh bam_directory
4.2.2 Compressing fastq files
sbatch --mail-user ur_email_at_domain scripts/compress_fastq.sh fastq_Directory