Nanopore module [New in 2.0]#

The circtools nanopore module is designed to process

circtools nanopore requires sequencing reads that have been produced using the protocol outlined in Rahimi et al. (2021). Specifically, OxFord Nanopore sequencing reds from this protocol will cover circular RNAs over their full length once or even multiple times. The circtools circtools nanopore module is based on code published as part of the Rahimi et al. (2021) study (GitHub repository). The original code has been thoroughly updated and the circtools nanopore module now provides support for hg38 and mm10 through a completely new reference data retrieval system that automatically downloads and postprocesses reference data from public databases such as Genome Browser and ENSEMBL without the need for short-term links to private online drives. This new system also enables easy addition of other model species as long as they are supported by ENSEMBL and the Genome Browser. Moreover, Perl scripts were replaced by new implementations in Python3 to seamlessly integrate in the circtools 2.0 software framework and reduce the overall software maintenance burden.

Required tools and packages#

nanopore depends several external tools, namely

General usage#

A call to circtools nanopore --help shows all available command line flags:

usage: circtools.py [-h] (-r | -c | -d) [-s SAMPLE] [-R REFERENCE_PATH] [-O OUTPUT_PATH] [-C {hg19,hg38,mm9,mm10}] [-t THREADS] [-D] [-k]

circular RNA detection in Oxford Nanopore data

options:
  -h, --help            show this help message and exit
  -r, --run             Run the analysis
  -c, --check           Check the installation for required software.
  -d, --download        Download third-party data, such as genomes required for the analysis.

Options:
  -s SAMPLE, --sample SAMPLE
                        Provide a sample input .fq.gz file that should be processed.
  -R REFERENCE_PATH, --reference-path REFERENCE_PATH
                        Provide a path for where the reference data is located. Default is './data'.
  -O OUTPUT_PATH, --output OUTPUT_PATH
                        Provide a path for where the output data is stored.
  -C {hg19,hg38,mm9,mm10}, --config {hg19,hg38,mm9,mm10}
                        Required. Select which genome build the sample that is from, and specify which genome reference files should be used.
  -t THREADS, --threads THREADS
                        Number of threads for parallel steps. Default: 4.
  -D, --dry-run         Perform all of the input checks without starting the detection scripts.
  -k, --keep-temp       Keep all of the temporary files.

Setup: Check if external software is available#

Note

If the Docker image is used all required software is already installed within the image.

In order to check if the external software has been installed correctly and can be used by the nanopore module a check can be run:

circtools nanopore -c

This should produce the following output, indicating tat software is accessible:

Checking for bedtools
Checking for NanoFilt
Checking for pblat
Checking for samtools

All of the expected software requirements are present!

Should software not be installed, e.g. pblat, an error message is shown:

Checking for bedtools
Checking for NanoFilt
Checking for pblat
        Unable to find pblat!
Checking for samtools

ERROR: Some of the required software is missing!

Step 1: Download required data#

circtools nanopore -d -R reference/ -C hg38

Here the reference data will be downloaded into in the folder reference/ and we are download all require files for the human genome, build hg38. The folder will be automatically created if it does not exist. For each reference genome, a suitable sub-folder will be created, e.g. hg38 which contains all required and post-processed files. All downloads are linking to public sources, such as the Genome Browser; links are stored in YAML files available in the GitHub repository.

We are welcoming pull requests for additional genome builds!

The download progress is visible in the command line together with automatic post-processing:

Storing reference data in reference/
Downloading genome.fa.gz: 100%|█████████████████████████████████████████████| 984M/984M [01:00<00:00, 16.3MB/s]
Unpacking.
Done.
Downloading genome.chrom.sizes: 100%|██████████████████████████████████████| 11.7k/11.7k [00:00<00:00, 602kB/s]
Downloading refFlat.csv.gz: 3.92MB [00:01, 3.22MB/s]
Creating refFlat-based exon files
Downloading gencode.csv.gz: 100%|█████████████████████████████████████████| 59.0M/59.0M [00:34<00:00, 1.70MB/s]
Unpacking.
Done.
Creating GENCODE-based exon files
Start parsing GTF file
Downloading gencode_intron.bed.gz: 8.74MB [00:03, 2.51MB/s]
Unpacking.
Done.
Downloading est.bed.gz: 444MB [03:20, 2.22MB/s]
Unpacking.
Done.

In the above example, the folder reference/hg38/ should now contain the following files occupying around 8GB of disk space.

est.bed
gencode.csv
gencode.csv.exon.bed
gencode.csv.exon.merge.bed
gencode_intron.bed
genome.chrom.sizes
genome.fa
refFlat.csv.gz
refFlat.csv.merged.bed
refFlat.csv.sort.bed
refFlat.csv.unique.bed

The file names are identical for each genome build, only the folder name indicates which genome is stored in each folder.

Step 2: Run the nanopore pipeline#

To run the main workflow of the circtools nanopore module, users need to specify the reference genome (-R reference/), output path (-O results/), and the FASTQ file containing the Oxford Nanopore reads (-s human_nanopore.fastq.gz). An example dataset consisting of 100k human brain nanopore reads is available for download. The --threads 16 argument is optional, but can be supplied to speed up processing by using multiple CPU threads, in this case 16 threads:

circtools.py nanopore -r -s human_nanopore.fastq.gz -R reference/ -C hg38 -O results/ --threads 16

The pipeline outputs a number of output files, specifically:

ls -la results/

human_nanopore.circ_circRNA_exon_usage_length_of_exons.txt
human_nanopore.circRNA_candidates.annotated.txt
human_nanopore.novel.cryptic.spliced.exons.txt
human_nanopore.novel.exons.2reads.filter.bed
human_nanopore.novel.exons.2reads.phases.tab
human_nanopore.Potential_multi-round_circRNA.fa
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.10reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.20reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.2reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.3reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.50reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.5reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.notGencode.bed
human_nanopore.scan.Potential_multi-round_circRNA.psl.annot.bed
human_nanopore.scan.Potential_multi-round_circRNA.psl.annot.count.txt

The files are prefixed with the sample name (input FASTQ file name minus extension) and are named intuitively. The main output file has the suffix circRNA_candidates.annotated.txt and contains the list of circRNAs detected in the run. Specifically, the files contains the following columns for each circRNA:

 1  internal_circRNA_name
 2  chr
 3  start
 4  end
 5  description
 6  BSJ_reads
 7  strand
 8  gene
 9  reserved
10  reserved
11  reserved
12  mean_read_coverage
13  mean_gene_coverage
14  mean_exon_coverage
15  mean_EST_coverage
16  mean_intron_coverage
17  min_exon_adjust
18  max_exon_adjust
19  mean_exon_adjust