Complete RNAseq alignment guide – from fastq to count table

Even with the advent of single-cell sequencing, bulk RNAseq remains a powerful (and relatively inexpensive) tool for comparing the transcriptional profiles of tissues, cultures, and sorted cells. Here I will show you how to analyze the data starting with fastq. Some familiarity with BASH/Linux command line is required.

System requirements:
Unix-based operating system
Star is memory hungry so you should have at a minimum 32 gb of ram on whatever machine you are using, but more is better
Technically only 1 CPU thread is needed, but more threads will reduce the time to run considerably

Step 1: making a STAR genome index

STAR is a powerful aligner used in many RNA alignment pipelines. STAR requires only two things to run: 1) a genome index and 2) your fastq files. To generate the index we need a genome fasta file and a genome annotation file.

Step 1.a Installing STAR

There are multiple ways to install STAR, but by far the easiest way to install it is through Conda. Conda is relatively straightforward so I will not go in-depth here, but I do have a video that goes over the basic installation:

If you have a Conda environment, STAR instillation is very simple:

conda install -c bioconda star

Step 1.b. Download genome fasta and annotation file from Ensembl

Ensembl is a great resource for genome information on many different species. I will show you how to retrieve the files from their site. However, you can use a different source for your fasta and GTF files, such as UCSF or NCBI.

Ensembl uses their own gene identifiers–we will convert from Ensembl ids to gene symbols during the differential expression step.

Human genome information can be found at: https://ensembl.org/Homo_sapiens/Info/Index. I will be using human, but select the organism for which the cells or culture of your experiment was derived.

Follow the above links to the Ensemble FTP site where you will see a list of different hyperlinks.

Fasta: there are many choices. You want the soft masked primary assembly. It should be named Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

You can download to your computer directly by clicking the file name, or you can right click the link and copy link address to use a wget on your remote server. If your browser computer is different from the computer you will run STAR on, I recommend the latter to save time. For example:

wget http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

For the GTF file you want Homo_sapiens.GRCh38.105.gtf.gz

After downloading the files to your working directory, unzip the files with gunzip

#if they are the only .gz files in the directory
gunzip *.gz 

#or
gunzip Homo_sapiens.GRCh38.105.gtf.gz
gunzip Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz

Optional GTF filtering step

I will not go over this during this video because it is not required, but depending on your project you can filter the annotations in the GTF to better suit your goals. For example, you can get rid of the majority of the annotations that do not produce RNA. Or, you could filter out everything except coding genes if you don’t care about non-coding RNAs.

Step 1.c. generating index with STAR

Now that you have the required files, you can use STAR geneomeGenerate to create the index

mkdir HS_STAR_dir #make a directory to put the index in

STAR --runMode genomeGenerate \
--genomeDir HS_STAR_dir/ \
--genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
--sjdbGTFfile Homo_sapiens.GRCh38.105.gtf \
--runThreadN <number of threads to use>

Depending on how many threads you use, the above step can take anywhere from ~30 minutes to several hours.

Step 2: Aligning reads with STAR

Now that you have the genome index, aligning your reads is pretty straight forward:

Running STAR against one unpaired fastq would look like this:

STAR --runMode alignReads \
--genomeDir HS_STAR_dir/ \
--outSAMtype BAM SortedByCoordinate \
--readFilesIn <path to fastq> \
--runThreadN <number of threads to use>
--outFileNamePrefix <path to output file>

If your files are gziped (end in .gz) you need to add the flag:

--readFilesCommand zcat

If you have paired end reads or multiple fastq files, list all the file paths after –readFilesIn with only a space between.

If you have multiple samples (which is likely), I recommend making a .sh file where you edit each line to point to the right input and output paths, or you can use a one line loop (shown in the video above).

Step 3: Generating the count table

Step 3.a. Installing featureCounts

We will use the featureCounts package from subread to generate the count table. Again, installing this through conda is by far the easiest option:

conda install -c bioconda subread

Step 3.b. using featurecounts on the output STAR bams

This step is very straight forward. The output from the STAR alignment will be bam files. Make sure the bam files have names you can use to differentiate between the samples. We can run featurecounts with a wildcard to run it on all the samples at once.

featureCounts \
-a <path/to/gtf/file.gtf> \
-T <number of threads>
<path/to/sample/bam/dir>/*.bam

What is a count table?

A count table contains the number of reads that mapped to each gene that was denoted in the annotation file. Each column is a sample and each row is a gene.

Step 3.c. (optional) clean up count table

Thats it!
You now have a count table. You can clean up the count table because some of the columns are not needed. You can give the sample columns more meaningful names other than the path to the file, which will be their default name.

Continued analysis… differential expression, etc..

This will be covered in future posts; however, my youtube channel has many of these already.