Detect module

The circRNA detection module f circtools is based on DCC, a Python 2.7-based tool built to detect and quantify circRNAs with high specificity. DCC works with the STAR (Dobin et al., 2013) chimeric.out.junction files which contains chimerically aligned reads including circRNA junction spanning reads.

DCC depends on pysam, pandas, numpy, and HTSeq. The installation process of circtools will normally automatically check for the dependencies, install or update missing Python packages and install the latest stable version of DCC.

Manual installation instructions

Option 1

Download of the latest stable DCC release:

$ tar -xvf DCC-<version>.tar.gz
$ cd DCC-<version>
$ python setup.py install --user

Option 2

Cloning the source repository:

$ git clone https://github.com/dieterich-lab/DCC.git
$ cd DCC
$ python setup.py install --user

Verifying the installation:

$ DCC --version

If the Python installation binary path [e.g. /$HOME/.local/bin for Debian] is not included the $PATH, it is also possible run DCC directly:

$ python <DCC directory>/scripts/DCC <Options>
# or even
$ python <DCC directory>/DCC/main.py <Options>

Usage

The detection of circRNAs from RNAseq data through the detection module can be summarized in a few steps:

  • Mapping of RNAseq data from quality checked fastq files. For paired-end (PE) data it is recommended to map the reads pairs jointly as well as separately because STAR does not output reads or read pairs that contain more than one chimeric junction.
  • Preparation of input files required by the detection module. In summary, only a samplesheet, which specifies the locations for the chimeric.out.junction files is required (one relative or absolute path per line). [Command line flag: @samplesheet]
  • A GTF formatted annotation of repetitive regions which is used to filter out circRNA candidates from repetitive regions is recommended. [Command line flag: -R repeat_file.gtf]
  • For paired-end sequencing two files, e.g. mate1 and mate2 which contain the paths to the chimeric.out.junction files originating from the separate mate mapping step are required. [Command line flags: -m1 mate1file and -m2 mate2file]
  • The locations of the BAM files can be specified via the -B flag. Otherwise the detection module tries to guess their location based on the supplied chimeric.out.junction paths. [Command line flag: -B @bam_file_list]
  • The detection module requires the SJ.out.tab files generated by STAR. Circtools assumes that these are located in the same folder as the BAM files and also retain their SJ.out.tab name.
  • Circtools can be used to process circular RNA detection and host gene expression detection either in a one-pass strategy or only one part of the analysis:
  • Detection of circRNAs and host gene expression: -D and -G option have to be supplied
  • Detection of circRNAs only: -D
  • Detection of host gene expression only: -G

Step by step tutorial

In this tutorial, we use the data set from Westholm et al. 2014 as an example. The data are paired-end, stranded RiboMinus RNAseq data from Drosophila melanogaster, consisting of samples of three developmental stages (1 day, 4 days, and 20 days) collected from the heads. Data can be downloaded from the NCBI SRA (accession number SRP001696).

Mapping of the fastq files with STAR

Note

--alignSJoverhangMin and --chimJunctionOverhangMin should use the same values to make the circRNA expression and linear gene expression level comparable.

  • In a first step the paired-end data is mapped by using both mates. If the data is paired-end, an additional separate mate mapping is recommended (while not mandatory, this step will increase the sensitivity due to the the processing of small circular RNAs with only one chimeric junction point at each read mate). If the data is single-end, only this mapping step is required. In case of the Westholm data however, paired-end sequencing data is available.
$ mkdir Sample1
$ cd Sample1
$ STAR --runThreadN 10 \
       --genomeDir [genome] \
       --outSAMtype BAM SortedByCoordinate \
       --readFilesIn Sample1_1.fastq.gz  Sample1_2.fastq.gz \
       --readFilesCommand zcat \
       --outFileNamePrefix [sample prefix] \
       --outReadsUnmapped Fastx \
       --outSJfilterOverhangMin 15 15 15 15 \
       --alignSJoverhangMin 15 \
       --alignSJDBoverhangMin 15 \
       --outFilterMultimapNmax 20 \
       --outFilterScoreMin 1 \
       --outFilterMatchNmin 1 \
       --outFilterMismatchNmax 2 \
       --chimSegmentMin 15 \
       --chimScoreMin 15 \
       --chimScoreSeparation 10 \
       --chimJunctionOverhangMin 15 \

Warning

Starting with version 2.6.0 of STAR, the chimeric output format changed. In order to be compliant with the circtools work flow the old output mode has to be selected via --chimOutType Junctions SeparateSAMold

  • This step may be skipped when single-end data is used. Separate per-mate mapping. The naming of mate1 and mate2 has to be consistent with the order of the reads from the joint mapping performed above. In this case, SamplePairedRead_1.fastq.gz is the first mate since it was referenced at the first position in the joint mapping.
# Create a directory for mate1
$ mkdir mate1
$ cd mate1
$ STAR --runThreadN 10 \
       --genomeDir [genome] \
       --outSAMtype None \
       --readFilesIn Sample1_1.fastq.gz \
       --readFilesCommand zcat \
       --outFileNamePrefix [sample prefix] \
       --outReadsUnmapped Fastx \
       --outSJfilterOverhangMin 15 15 15 15 \
       --alignSJoverhangMin 15 \
       --alignSJDBoverhangMin 15 \
       --seedSearchStartLmax 30 \
       --outFilterMultimapNmax 20 \
       --outFilterScoreMin 1 \
       --outFilterMatchNmin 1 \
       --outFilterMismatchNmax 2 \
       --chimSegmentMin 15 \
       --chimScoreMin 15 \
       --chimScoreSeparation 10 \
       --chimJunctionOverhangMin 15 \

Warning

Starting with version 2.6.0 of STAR, the chimeric output format changed. In order to be compliant with the circtools work flow the old output mode has to be selected via --chimOutType Junctions SeparateSAMold

  • The process is repeated for the second mate:
# Create a directory for mate2
$ mkdir mate2
$ cd mate2
$ STAR --runThreadN 10 \
       --genomeDir [genome] \
       --outSAMtype None \
       --readFilesIn Sample1_2.fastq.gz \
       --readFilesCommand zcat \
       --outFileNamePrefix [sample prefix] \
       --outReadsUnmapped Fastx \
       --outSJfilterOverhangMin 15 15 15 15 \
       --alignSJoverhangMin 15 \
       --alignSJDBoverhangMin 15 \
       --seedSearchStartLmax 30 \
       --outFilterMultimapNmax 20 \
       --outFilterScoreMin 1 \
       --outFilterMatchNmin 1 \
       --outFilterMismatchNmax 2 \
       --chimSegmentMin 15 \
       --chimScoreMin 15 \
       --chimScoreSeparation 10 \
       --chimJunctionOverhangMin 15 \

Warning

Starting with version 2.6.0 of STAR, the chimeric output format changed. In order to be compliant with the circtools work flow the old output mode has to be selected via --chimOutType Junctions SeparateSAMold

Detection of circular RNAs from chimeric.out.junction files with circtools

Acquiring suitable GTF files for repeat masking

  • It is strongly recommended to specify a repetitive region file in GTF format for filtering.
  • A suitable file can for example be obtained through the UCSC table browser . After choosing the genome, a group like Repeats or Variation and Repeats has to be selected. For the track, we recommend to choose RepeatMasker together with Simple Repeats and combine the results afterwards.
  • Note: the output file needs to comply with the GTF format specification. Additionally it may be the case that the names of chromosomes from different databases differ, e.g. 1 for chromosome 1 from ENSEMBL compared to chr1 for chromosome 1 from UCSC. Since the chromosome names are important for the correct functionality of circtools a sample command for converting the identifiers may be:
# Example to convert UCSC identifiers to to ENSEMBL standard

$ sed -i 's/^chr//g' your_repeat_file.gtf

Preparation of files containing the paths to required chimeric.out.junction files

  • samplesheet file, containing the paths to the jointly mapped chimeric.out.junction files
$ cat samplesheet
/path/to/STAR/sample_1/joint_mapping/chimeric.out.junction
/path/to/STAR/sample_2/joint_mapping/chimeric.out.junction
/path/to/STAR/sample_3/joint_mapping/chimeric.out.junction
  • mate1 file, containing the paths to chimeric.out.junction files of the separately mapped first read of paired-end data
$ cat mate2
/path/to/STAR/sample_1_mate1/joint_mapping/chimeric.out.junction
/path/to/STAR/sample_2_mate1/joint_mapping/chimeric.out.junction
/path/to/STAR/sample_3_mate1/joint_mapping/chimeric.out.junction
  • mate2 file, containing the paths to chimeric.out.junction files of the separately mapped first read of paired-end data
$ cat mate2
/path/to/STAR/sample_1_mate2/joint_mapping/chimeric.out.junction
/path/to/STAR/sample_2_mate2/joint_mapping/chimeric.out.junction
/path/to/STAR/sample_3_mate2/joint_mapping/chimeric.out.junction

Pre-mapped chimeric.out.junction files from Westholm et al. data set are part of the DCC distribution

$ <DCC directory>/DCC/data/samplesheet # jointly mapped chimeric.junction.out files
$ <DCC directory>/DCC/data/mate1 # mate1 independently mapped chimeric.junction.out files
$ <DCC directory>/DCC/data/mate1 # mate2 independently mapped chimeric.junction.out files

Runnning circtools circRNA detection

After performing all preparation steps the detection module can now be started:

# Run circtools detection to detect circRNAs, using Westholm data as example

$ circtools detect @samplesheet \ # @ is generally used to specify a file name
      -mt1 @mate1 \ # mate1 file containing the mate1 independently mapped chimeric.junction.out files
      -mt2 @mate2 \ # mate2 file containing the mate1 independently mapped chimeric.junction.out files
      -D \ # run in circular RNA detection mode
      -R [Repeats].gtf \ # regions in this GTF file are masked from circular RNA detection
      -an [Annotation].gtf \ # annotation is used to assign gene names to known transcripts
      -Pi \ # run in paired independent mode, i.e. use -mt1 and -mt2
      -F \ # filter the circular RNA candidate regions
      -M \ # filter out candidates from mitochondrial chromosomes
      -Nr 5 6 \ minimum count in one replicate [1] and number of replicates the candidate has to be detected in [2]
      -fg \ # candidates are not allowed to span more than one gene
      -G \ # also run host gene expression
      -A [Reference].fa \ # name of the fasta genome reference file; must be indexed, i.e. a .fai file must be present

# For single end, non-stranded data:
$ circtools detect @samplesheet -D -R [Repeats].gtf -an [Annotation].gtf -F -M -Nr 5 6 -fg -G -A [Reference].fa

$ circtools detect @samplesheet -mt1 @mate1 -mt2 @mate2 -D -S -R [Repeats].gtf -an [Annotation].gtf -Pi -F -M -Nr 5 6 -fg

# For details on the parameters please refer to the help page of the detection module:
$ circtools detect -h

Note

By default, circtools assumes that the data is stranded. For non-stranded data the -N flag should be used

Note

Although not mandatory, we strongly recommend to the -F filtering step

Output files

The output of circtools detect consists of the following four files: CircRNACount, CircCoordinates, LinearCount and CircSkipJunctions.

  • CircRNACount: a table containing read counts for circRNAs detected. First three columns are chr, circRNA start, circRNA end. From fourth column on are the circRNA read counts, one sample per column, shown in the order given in your samplesheet.
  • CircCoordinates: circular RNA annotations in BED format. The columns are chr, start, end, genename, junctiontype (based on STAR; 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT), strand, circRNA region (startregion-endregion), overall regions (the genomic features circRNA coordinates interval covers).
  • LinearCount: host gene expression count table, same setup with CircRNACount file.
  • CircSkipJunctions: circSkip junctions. The first three columns are the same as in LinearCount/CircRNACount, the following columns represent the circSkip junctions found for each sample. circSkip junctions are given as chr:start-end:count, e.g. chr1:1787-6949:10. It is possible that for one circRNA multiple circSkip junctions are found due to the fact the the circular RNA may arise from different isoforms. In this case, multiple circSkip junctions are delimited with semicolon. A 0 implies that no circSkip junctions have been found for this circRNA.

Feedback

  • In case of any problems or feature requests please do not hesitate to open an issue on GitHub: Create an issue
  • Please make sure to run circtools detect with the -k flag when reporting an error to keep temporary files important for debugging purposes
  • Please also always paste or include the log file