4 Pipeline 4-RNA-seq

A slurm based schema for RNA-seq analysis to execute on linux clusters.

The purpose of this project to develop a list of scripts whose input and output will be folders containing either fastq or bam files. This schema contains easily customizable bash scripts and directory structure.

This pipeline and workflow is based on Taito.csc server batch scripts. The objective of this documentation is to make execution faster and reproducible as much as possible. The project folder will contain these folders before starting

  • scripts : contains all scripts to run in taito server
  • OUT : contains output files defined by #SBATCH -o in all scripts
  • ERROR : contains error files defined in #SBATCH -e in all scripts
  • commands : contains actual history of commands
  • rawReads : should contain sequencing reads generated by the sequencing machine. This folder name could be anything.

4.0.1 Dependency

  • Multiqc ( run almost after all the commands) { installation guide}. This has been already done in previous class.

4.0.2 Installation

4-RNA-seq is required to download for every new project.

4.0.2.1 Download

For each experiment 4-RNA-seq pipeline needs to be downloaded separately. Let us download it to a directory named “day10” with following commands

cd $WRKDIR
cd DONOTREMOVE

git clone https://github.com/vondoRishi/4-RNA-seq day10

### Check what is in day10
tree -C day10

From now on day10 is the project space

4.0.2.2 Prepare the workspace

Make a directory rawreads inside day10 and copy old fastq(.gz) files from day7 exercises to there.

mkdir day10/rawReads
cp day7/RNAseq_exercise/rawReads/*fastq.gz day10/rawReads

4.1 How to run

4.1.1 Download public resources

4.1.1.1 rRNA fasta files

Download few fasta files from sortmerna website and combine them (this is just for demonstration).

## move to project directory
cd day10

mkdir rRNAFasta

wget -P rRNAFasta https://github.com/biocore/sortmerna/raw/master/rRNA_databases/silva-arc-16s-id95.fasta
wget -P rRNAFasta https://github.com/biocore/sortmerna/raw/master/rRNA_databases/silva-arc-23s-id98.fasta

## Combine all rRNA fasta files in a single file
cat rRNAFasta/silva-arc-* > rRNAFasta/combined_rRNA.fasta
### Copy the path of combined_rRNA.fasta 
find $PWD/rRNAFasta/

### reuse the mouse genome and gtf files downloaded earlier. Their paths are available at star_index.sh 

Before execution please define the project variables in the 4-rna-seq.config file. These values will be used by different scripts of this pipeline. Edit this file with nano

## There should not be any space around "="
project_name="test_name"
sequence_type="single" ## single ## for paired-end need to change the scripts manually

### parameter used by HTSeq; is data strand specific ? 
stranded="yes" ### yes|reverse|no

### sortmerna parameter
sortMeRNA_ref="$WRKDIR/DONOTREMOVE/Mouse_genome/rRNA_operon/Mouse_ribosomal_operon.fasta"

### STAR aligner parameter
maxReadLength=10 ## this parameter should be set after finishing the QC

### parameter used by STAR and HTSeq
genome_file="$WRKDIR/DONOTREMOVE/Mouse_genome/Mus_musculus_GRCm38.fa"
gene_annotation="$WRKDIR/DONOTREMOVE/Mouse_genome/Mus_musculus.GRCm38.79.gtf"

### parameter used by TopHat / Cufflinks / Cuffdiff; 
library_type="fr-secondstrand" ## fr-secondstrand|fr-firststrand|fr-unstranded 

4.1.2 Step 1-3. QC and Filtering

After running each ‘sbatch’ command always check the sizes of output files, zero file size is absolute mark of something went wrong.

4.1.2.1 Step 1. QA of raw reads

Start QC ( quality checking) with Fastqc and Multiqc. In earlier class we have executed fastqc and multiqc separately. However the scripts/fastqc.sh executes both Fastqc and Multiqc internally.

sbatch --mail-user ur_email_at_domain scripts/fastqc.sh rawReads  ## Don't use "rawReads/" 

Input : any directory with fastq or fastq.gz files, here rawReads
Output : Output generated in input directory, here rawReads/rawReads.html and other fastqc.html files

4.1.2.2 Step 2. Filtering rRNA

sbatch --mail-user ur_email_at_domain scripts/sortmerna.sh rawReads sortMeRna   

Input: directory with fastq files
Output: sortMeRna, the folder contains many different types of file. Fastq/fq files starting with non_Rna will be used in downstream analysis. Files with .log will be used by multiqc to summarize. The “rRna” fastq/fq and “.sam” files should be (re)moved from sortMeRna before next step.

  • After (re)moving “rRna*.fq" files, the rest of the .fq files could be compressed.
sbatch --mail-user ur_email_at_domain scripts/compress_fastq.sh sortMeRna  

Input: directory with fastq files
Output: same as input directory

  • Now summarize the presence of rRNA.
sbatch  --mail-user ur_email_at_domain scripts/fastqc.sh sortMeRna  

4.1.2.3 Step 3. Quality trimming and adapter removal

Needs to edit scripts/trimmo.sh for additional paramenters of Trimmomatic program

sbatch  --mail-user ur_email_at_domain scripts/trimmo.sh sortMeRna trimmomatic  

Input : directory with fastq or fastq.gz files
Output : directory with trimmed reads in fastq format

sbatch  --mail-user ur_email_at_domain scripts/fastqc.sh trimmomatic  

4.1.3 Step 4. Alignment of reads

This script will
1) index the genome in a folder star-genome_ann_Indices,
2) align the reads from individual samples
3) run multiqc.

Confirm the parameters in file 4-rna-seq.config

  • maxReadLength to maximum read length
  • genome_file to path to reference genome
  • gene_annotation path to gtf file
sbatch  --mail-user ur_email_at_domain scripts/star.sh trimmomatic starAlign   

Input: folder which contains the filtered reads; ex. good or sortMeRna
Output: Directory with bam files and multiqc report.

4.1.4 Step 5. Read count

Confirm the parameters in file 4-rna-seq.config

  • stranded depending upon the library type
  • gene_annotation path to gtf file
sbatch  --mail-user ur_email_at_domain scripts/star_htseq-count.sh starAlign star_count   

Input: Directory with bam files
Output: Directory with count values, star_count/htseq_*txt and quality report at star_count.html

4.1.5 Final report

Till now we have generated multiqc reports for every command or folder. Now to summarize all in one place execute. Edit multiqc multiqc_config.yaml file if require

sbatch  --mail-user ur_email_at_domain scripts/multiqc_slurm.sh

4.2 EXTRA

4.2.1 Alignment read viewer

Need to sort (uncomment for tophat output bams) and index bam files.

sbatch  --mail-user ur_email_at_domain scripts/samtools_index.sh bam_directory

4.2.2 Compressing fastq files

sbatch  --mail-user ur_email_at_domain scripts/compress_fastq.sh fastq_Directory