Detect module#

The circRNA detection module of circtools is based on DCC, built to detect and quantify circRNAs with high specificity. DCC works with the STAR (Dobin et al., 2013) Chimeric.out.junction files which contains chimerically aligned reads including circRNA junction spanning reads.

The module depends on pysam, pandas, numpy, and HTSeq. The installation process of circtools will automatically check for the dependencies and install or update missing Python packages.

General usage#

The detection of circRNAs from RNA-seq data through the detection module can be summarized in a few steps:

  • Mapping of RNA-seq data from quality checked Fastq files. For paired-end (PE) data it is recommended to map the reads pairs jointly as well as separately because STAR does not output reads or read pairs that contain more than one chimeric junction.

  • Preparation of input files required by the detection module. In summary, only a samplesheet, which specifies the locations for the chimeric.out.junction files is required (one relative or absolute path per line). [Command line flag: @samplesheet]

  • A GTF formatted annotation of repetitive regions which is used to filter out circRNA candidates from repetitive regions is recommended. [Command line flag: -R repeat_file.gtf]

  • For paired-end sequencing two files, e.g. mate1 and mate2 which contain the paths to the chimeric.out.junction files originating from the separate mate mapping step are required. [Command line flags: -m1 mate1file and -m2 mate2file]

  • The locations of the BAM files can be specified via the -B flag. Otherwise the detection module tries to guess their location based on the supplied chimeric.out.junction paths. [Command line flag: -B @bam_file_list]

  • The detection module requires the SJ.out.tab files generated by STAR. Circtools assumes that these are located in the same folder as the BAM files and also retain their SJ.out.tab name.

  • Circtools can be used to process circular RNA detection and host gene expression detection either in a one-pass strategy or only one part of the analysis:

  • Detection of circRNAs and host gene expression: -D and -G option have to be supplied

  • Detection of circRNAs only: -D

  • Detection of host gene expression only: -G

Step by step tutorial#

In this tutorial, we use the data set from Jakobi et al. 2016 as an example. The data are paired-end, stranded RiboMinus RNAseq data from Mus musculus, consisting of samples of four ages (2, 3, 6, and 12 month) collected from the whole hearts. Data can be downloaded from the NCBI SRA (accession number SRP071584). While the circtools suite does not offer specific module for the initial data processing, this short tutorial should give the user an idea on how to get the sequencing data in shape for the main circtools pipeline.

Throughout this tutorial, we will employ Bash wrapper scripts that automate the analysis for multiple samples. While these scripts have been designed to be used with the SLURM workload manager, it is also possible to use them in conjunction with GNU parallel without SLURM.

Download of raw data files from the NCBI SRA#

The raw data of the Jakobi et al. 2016 study was uploaded to the NCBI short read archive (SRA) and converted to the NCBI SRA format. Before we can start processing the data, the sra-toolkit needs to be installed. In order to simplify the download process, the wonderdump script is used.

# create main folder and raw reads folder
mkdir -p workflow/reads

cd workflow/reads

# place your copy of wonderdump.sh in this directory
# also make it executable
wget https://links.jakobilab.org/wonderdump.sh
chmod 755 wonderdump.sh

# get list of accession numbers (sequencing runs) to download
# also get a mapping file from SRA accession to original file name
wget https://links.jakobilab.org/jakobi2016_sra_list.txt
wget https://links.jakobilab.org/file_name_mapping.txt

# downloading and rewriting the files as gzipped .fastq files will take some time
# in the end, the process will generate a set of 16 files (8 samples x 2 pairs)
# this step requires an installed and configured NCBI sratoolkit

# start wonderdump with the accession list and download data (~29GB)
for i in $( jakobi2016_sra_list.txt )
do
    ./wonderdump.sh --split-files --gzip "$i" &
    sleep 60
done

# rename files from SRA accessions to file names used throughout this tutorial

# mate 1
ln -s SRR7881338_1.fastq.gz ALL_1654_N__R1.fastq.gz
ln -s SRR7881276_1.fastq.gz ALL_1654_M__R1.fastq.gz
ln -s SRR7881275_1.fastq.gz ALL_1654_P__R1.fastq.gz
ln -s SRR7881335_1.fastq.gz ALL_1654_O__R1.fastq.gz
ln -s SRR7881337_1.fastq.gz ALL_1654_Q__R1.fastq.gz
ln -s SRR7881336_1.fastq.gz ALL_1654_R__R1.fastq.gz
ln -s SRR7881333_1.fastq.gz ALL_1654_S__R1.fastq.gz
ln -s SRR7881334_1.fastq.gz ALL_1654_T__R1.fastq.gz

# mate 2
ln -s SRR7881338_2.fastq.gz ALL_1654_N__R2.fastq.gz
ln -s SRR7881276_2.fastq.gz ALL_1654_M__R2.fastq.gz
ln -s SRR7881275_2.fastq.gz ALL_1654_P__R2.fastq.gz
ln -s SRR7881335_2.fastq.gz ALL_1654_O__R2.fastq.gz
ln -s SRR7881337_2.fastq.gz ALL_1654_Q__R2.fastq.gz
ln -s SRR7881336_2.fastq.gz ALL_1654_R__R2.fastq.gz
ln -s SRR7881333_2.fastq.gz ALL_1654_S__R2.fastq.gz
ln -s SRR7881334_2.fastq.gz ALL_1654_T__R2.fastq.gz

Data structure#

Once the data is downloaded, the following data structure is assumed:

cd workflow/reads

ls -la

ALL_1654_M__R1.fastq.gz
ALL_1654_M__R2.fastq.gz
ALL_1654_N__R1.fastq.gz
ALL_1654_N__R2.fastq.gz
ALL_1654_O__R1.fastq.gz
ALL_1654_O__R2.fastq.gz
ALL_1654_P__R1.fastq.gz
ALL_1654_P__R2.fastq.gz
ALL_1654_Q__R1.fastq.gz
ALL_1654_Q__R2.fastq.gz
ALL_1654_R__R1.fastq.gz
ALL_1654_R__R2.fastq.gz
ALL_1654_S__R1.fastq.gz
ALL_1654_S__R2.fastq.gz
ALL_1654_T__R1.fastq.gz
ALL_1654_T__R2.fastq.gz

Adapter removal and quality trimming#

In a first step, remaining adapter sequences originating from the sequencing-process need to be removed together with potential low-quality bases. There are plenty of tools available to perform this task, in this tutorial we go with flexbar. The next steps assume that flexbar has been installed and in visible in the $PATH variable.

# download the wrapper scrips for flexbar
wget https://raw.githubusercontent.com/jakobilab/bioinfo-scripts/master/slurm_flexbar_paired.sh
# add execute permission
chmod 755 slurm_flexbar_paired.sh
mkdir flexbar
cd reads
# we now execute flexbar on all of the sample while keeping all paired end sample correctly mapped
parallel -j1 --xapply ../slurm_flexbar_paired.sh {1} {2} ../flexbar/ _R1 ::: *_R1.fastq.gz ::: *_R2.fastq.gz

After flexbar finished processing, the folder flexbar/ contains the trimmed, quality-filtered read sets.

Removal of rRNA with Bowtie2#

Next, rRNA reads are removed. Normally, we are not interested in these reads, especially when performing circRNA analysis. Removing those reads will also slightly speed up subsequent steps due to the reduced computational load. In this tutorial, Bowtie2 is used in order to discard reads that map against rRNA loci. This tutorial assumes bowtie2 has already been installed an is callable with $PATH. A precompiled bowtie2 index of mouse rRNA loci can has been uploaded for this purpose. In brief, bowtie2 maps the reads against a “reference genome” of rRNA loci and only keeps reads, that do not align and therefore are rRNA-free.

# download wrapper for bowtie2
wget https://raw.githubusercontent.com/jakobilab/bioinfo-scripts/master/slurm_bowtie2_rRNA_filter_paired.sh
chmod 755 slurm_bowtie2_rRNA_filter_paired.sh
mkdir rrna/
cd flexbar/
parallel -j1 --xapply ../slurm_bowtie2_rRNA_filter_paired.sh /path/to/extracted/bowtie2/index/mus-musculus.rRNA {1} {2} ../rrna/ .1 ::: *_1.fastq.gz ::: *_2.fastq.gz

After this step the rrna/ folder contains adapter-free, rRNA-free reads ready for final mapping.

Mapping against the reference genome#

In order to map the preprocessed reads against the reference genome and, ultimately detect possible cirRNAs the STAR read mapper is employed. STAR has shown to exhibit a good performance, is highly customizable and, most importantly is able to directly export chimeric reads that are the basis for the circRNA detection process. Again, we are using a wrapper script that simplifies the process of calling STAR for all samples. Like bowtie2, STAR requires an index in order to align reads. Since building this index requires a huge amount of RAM, a precomputed STAR index for the mouse ENSEMBL 90 build has been prepared for direct download (~22GB).

Essentially, the wrapper script for STAR performs the following tasks:

  • Map both reads of each pair against the reference genome

  • Map the unmapped reads of read 1 or read 2 again against the reference genome without the corresponding paired partner

  • Several conversion and cleanup steps of the STAR output

Please note that the supplied examples are a little aged and will not work with the latest STAR version. Version 2.6.1d is known to work while version 2.7.0f and newer does not.

# download wrapper for STAR
wget https://raw.githubusercontent.com/jakobilab/bioinfo-scripts/master/slurm_circtools_detect_paired_mapping.sh
chmod 755 slurm_circtools_detect_paired_mapping.sh

To allow this script to work the following changes are needed:

  • Remove the ‘–chimMultimapNmax 20’ options.

  • Change the ‘–chimOutType’ options to ‘–chimOutType Junctions SeparateSAMold’.

Manual mapping process#

Following a few notes on the manual mapping process:

Note

--alignSJoverhangMin and --chimJunctionOverhangMin should use the same values to make the circRNA expression and linear gene expression level comparable.

In a first step the paired-end data is mapped by using both mates. If the data is paired-end, an additional separate mate mapping is recommended (while not mandatory, this step will increase the sensitivity due to the the processing of small circular RNAs with only one chimeric junction point at each read mate). If the data is single-end, only this mapping step is required. In case of the Jakobi et al. 2016 data however, paired-end sequencing data is available.

Warning

Starting with version 2.6.0 of STAR, the chimeric output format changed. In order to be compliant with the circtools work flow the old output mode has to be selected via --chimOutType Junctions SeparateSAMold

$ mkdir Sample1
$ cd Sample1
$ STAR    --runThreadN 10\
          --genomeDir [genome]\
          --genomeLoad NoSharedMemory\
          --readFilesIn Sample1_1.fastq.gz  Sample1_2.fastq.gz\
          --readFilesCommand zcat\
          --outFileNamePrefix [sample prefix]\
          --outReadsUnmapped Fastx\
          --outSAMattributes NH   HI   AS   nM   NM   MD   jM   jI   XS\
          --outSJfilterOverhangMin 15   15   15   15\
          --outFilterMultimapNmax 20\
          --outFilterScoreMin 1\
          --outFilterMatchNminOverLread 0.7\
          --outFilterMismatchNmax 999\
          --outFilterMismatchNoverLmax 0.05\
          --alignIntronMin 20\
          --alignIntronMax 1000000\
          --alignMatesGapMax 1000000\
          --alignSJoverhangMin 15\
          --alignSJDBoverhangMin 10\
          --alignSoftClipAtReferenceEnds No\
          --chimSegmentMin 15\
          --chimScoreMin 15\
          --chimScoreSeparation 10\
          --chimJunctionOverhangMin 15\
          --sjdbGTFfile [GTF annotation]\
          --quantMode GeneCounts\
          --twopassMode Basic\
          --chimOutType Junctions SeparateSAMold
  • This step may be skipped when single-end data is used. Separate per-mate mapping. The naming of mate1 and mate2 has to be consistent with the order of the reads from the joint mapping performed above. In this case, SamplePairedRead_1.fastq.gz is the first mate since it was referenced at the first position in the joint mapping.

# Create a directory for mate1
$ mkdir mate1
$ cd mate1
$ STAR    --runThreadN 10\
          --genomeDir [genome]\
          --genomeLoad NoSharedMemory\
          --readFilesIn Sample1_1.fastq.gz\
          --readFilesCommand zcat\
          --outFileNamePrefix [sample prefix]\
          --outReadsUnmapped Fastx\
          --outSAMattributes NH   HI   AS   nM   NM   MD   jM   jI   XS\
          --outSJfilterOverhangMin 15   15   15   15\
          --outFilterMultimapNmax 20\
          --outFilterScoreMin 1\
          --outFilterMatchNminOverLread 0.7\
          --outFilterMismatchNmax 999\
          --outFilterMismatchNoverLmax 0.05\
          --alignIntronMin 20\
          --alignIntronMax 1000000\
          --alignMatesGapMax 1000000\
          --alignSJoverhangMin 15\
          --alignSJDBoverhangMin 10\
          --alignSoftClipAtReferenceEnds No\
          --chimSegmentMin 15\
          --chimScoreMin 15\
          --chimScoreSeparation 10\
          --chimJunctionOverhangMin 15\
          --sjdbGTFfile [GTF annotation]\
          --quantMode GeneCounts\
          --twopassMode Basic\
          --chimOutType Junctions SeparateSAMold
  • The process is repeated for the second mate:

# Create a directory for mate2
$ mkdir mate2
$ cd mate2
$ STAR    --runThreadN 10\
          --genomeDir [genome]\
          --genomeLoad NoSharedMemory\
          --readFilesIn Sample1_2.fastq.gz\
          --readFilesCommand zcat\
          --outFileNamePrefix [sample prefix]\
          --outReadsUnmapped Fastx\
          --outSAMattributes NH   HI   AS   nM   NM   MD   jM   jI   XS\
          --outSJfilterOverhangMin 15   15   15   15\
          --outFilterMultimapNmax 20\
          --outFilterScoreMin 1\
          --outFilterMatchNminOverLread 0.7\
          --outFilterMismatchNmax 999\
          --outFilterMismatchNoverLmax 0.05\
          --alignIntronMin 20\
          --alignIntronMax 1000000\
          --alignMatesGapMax 1000000\
          --alignSJoverhangMin 15\
          --alignSJDBoverhangMin 10\
          --alignSoftClipAtReferenceEnds No\
          --chimSegmentMin 15\
          --chimScoreMin 15\
          --chimScoreSeparation 10\
          --chimJunctionOverhangMin 15\
          --sjdbGTFfile [GTF annotation]\
          --quantMode GeneCounts\
          --twopassMode Basic\
          --chimOutType Junctions SeparateSAMold

Detection of circular RNAs from chimeric.out.junction files with circtools#

Acquiring suitable GTF files for repeat masking#

  • It is strongly recommended to specify a repetitive region file in GTF format for filtering.

  • A suitable file can for example be obtained through the UCSC table browser . After choosing the genome, a group like Repeats or Variation and Repeats has to be selected. For the track, we recommend to choose RepeatMasker together with Simple Repeats and combine the results afterwards.

  • Note: the output file needs to comply with the GTF format specification. Additionally it may be the case that the names of chromosomes from different databases differ, e.g. 1 for chromosome 1 from ENSEMBL compared to chr1 for chromosome 1 from UCSC. Since the chromosome names are important for the correct functionality of circtools a sample command for converting the identifiers may be:

  • A sample repeat file for the mouse mm10 genome can also be downloaded

# Example to convert UCSC identifiers to to ENSEMBL standard

$ sed -i 's/^chr//g' your_repeat_file.gtf

Preparation of input files for circRNA detection step#

We first create a new folder for the circtools detection step and populate that folder with links to the required files produced by STAR.

# create new folder for circRNA detection
mkdir -p circtools_detect/

# the following script automatically prepares all
# required input files for circtools detect
wget https://links.jakobilab.org/prep_circtools.sh
chmod 755 prep_circtools.sh

# two parameters required:
# 1) STAR directory
# 2) circtools directory
./prep_circtools.sh star/ circtools_detect/

Additionally to the newly created files, a reference genome in Fasta format as well as an appropriate annotation containing repetitive regions should be provided:

# step one: obtain reference genome
wget ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
gzip -d Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

# step two: repeat masker file for the genome build:
wget https://links.jakobilab.org/mouse_repeats.gtf.bz2
bunzip2 mouse_repeats.gtf.bz2

Running circtools circRNA detection#

After performing all preparation steps the detection module can now be started:

# Run circtools detection to detect circRNAs, using the`Jakobi et al. 2016 <https://www.sciencedirect.com/science/article/pii/S167202291630033X>`_ data set

$ circtools detect @samplesheet \
      -mt1 @mate1 \
      -mt2 @mate2 \
      -B @bam_files.txt \
      -D \
      -R mouse_repeats.gtf \
      -an ../../Mus_musculus.GRCm38.90.gtf \
      -Pi \
      -F \
      -M \
      -Nr 5 6 \
      -fg \
      -G \
      -A Mus_musculus.GRCm38.dna.primary_assembly.fa

Note

By default, circtools assumes that the data is stranded. For non-stranded data the -N flag should be used

Note

Although not mandatory, we strongly recommend to the -F filtering step

Output files#

The output of circtools detect consists of the following four files: CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.

  • CircRNACount: a table containing read counts for circRNAs detected. First three columns are chr, circRNA start, circRNA end. From fourth column on are the circRNA read counts, one sample per column, shown in the order given in your samplesheet.

  • CircCoordinates: circular RNA annotations in BED format. The columns are chr, start, end, genename, junctiontype (based on STAR; 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT), strand, circRNA region (startregion-endregion), overall regions (the genomic features circRNA coordinates interval covers).

  • LinearCount: host gene expression count table, same setup with CircRNACount file.

  • CircSkipJunctions: circSkip junctions. The first three columns are the same as in LinearCount/CircRNACount, the following columns represent the circSkip junctions found for each sample. circSkip junctions are given as chr:start-end:count, e.g. chr1:1787-6949:10. It is possible that for one circRNA multiple circSkip junctions are found due to the fact the the circular RNA may arise from different isoforms. In this case, multiple circSkip junctions are delimited with semicolon. A 0 implies that no circSkip junctions have been found for this circRNA.

Feedback#

  • In case of any problems or feature requests please do not hesitate to open an issue on GitHub: Create an issue

  • Please make sure to run circtools detect with the -k flag when reporting an error to keep temporary files important for debugging purposes

  • Please also always paste or include the log file