Nanopore module [New in 2.0]#
The circtools nanopore module is designed to process
circtools nanopore
requires sequencing reads that have been produced using the protocol outlined in Rahimi et al. (2021). Specifically, OxFord Nanopore sequencing reds from this protocol will cover circular RNAs over their full length once or even multiple times. The circtools circtools nanopore
module is based on code published as part of the Rahimi et al. (2021) study (GitHub repository). The original code has been thoroughly updated and the circtools nanopore
module now provides support for hg38 and mm10 through a completely new reference data retrieval system that automatically downloads and postprocesses reference data from public databases such as Genome Browser and ENSEMBL without the need for short-term links to private online drives. This new system also enables easy addition of other model species as long as they are supported by ENSEMBL and the Genome Browser. Moreover, Perl scripts were replaced by new implementations in Python3 to seamlessly integrate in the circtools 2.0 software framework and reduce the overall software maintenance burden.
Required tools and packages#
nanopore
depends several external tools, namely
General usage#
A call to circtools nanopore --help
shows all available command line flags:
usage: circtools.py [-h] (-r | -c | -d) [-s SAMPLE] [-R REFERENCE_PATH] [-O OUTPUT_PATH] [-C {hg19,hg38,mm9,mm10}] [-t THREADS] [-D] [-k]
circular RNA detection in Oxford Nanopore data
options:
-h, --help show this help message and exit
-r, --run Run the analysis
-c, --check Check the installation for required software.
-d, --download Download third-party data, such as genomes required for the analysis.
Options:
-s SAMPLE, --sample SAMPLE
Provide a sample input .fq.gz file that should be processed.
-R REFERENCE_PATH, --reference-path REFERENCE_PATH
Provide a path for where the reference data is located. Default is './data'.
-O OUTPUT_PATH, --output OUTPUT_PATH
Provide a path for where the output data is stored.
-C {hg19,hg38,mm9,mm10}, --config {hg19,hg38,mm9,mm10}
Required. Select which genome build the sample that is from, and specify which genome reference files should be used.
-t THREADS, --threads THREADS
Number of threads for parallel steps. Default: 4.
-D, --dry-run Perform all of the input checks without starting the detection scripts.
-k, --keep-temp Keep all of the temporary files.
Setup: Check if external software is available#
Note
If the Docker image is used all required software is already installed within the image.
In order to check if the external software has been installed correctly and can be used by the nanopore module a check can be run:
circtools nanopore -c
This should produce the following output, indicating tat software is accessible:
Checking for bedtools
Checking for NanoFilt
Checking for pblat
Checking for samtools
All of the expected software requirements are present!
Should software not be installed, e.g. pblat, an error message is shown:
Checking for bedtools
Checking for NanoFilt
Checking for pblat
Unable to find pblat!
Checking for samtools
ERROR: Some of the required software is missing!
Step 1: Download required data#
circtools nanopore -d -R reference/ -C hg38
Here the reference data will be downloaded into in the folder reference/
and we are download all require files for the human genome, build hg38. The folder will be automatically created if it does not exist. For each reference genome, a suitable sub-folder will be created, e.g. hg38 which contains all required and post-processed files. All downloads are linking to public sources, such as the Genome Browser; links are stored in YAML files available in the GitHub repository.
We are welcoming pull requests for additional genome builds!
The download progress is visible in the command line together with automatic post-processing:
Storing reference data in reference/
Downloading genome.fa.gz: 100%|█████████████████████████████████████████████| 984M/984M [01:00<00:00, 16.3MB/s]
Unpacking.
Done.
Downloading genome.chrom.sizes: 100%|██████████████████████████████████████| 11.7k/11.7k [00:00<00:00, 602kB/s]
Downloading refFlat.csv.gz: 3.92MB [00:01, 3.22MB/s]
Creating refFlat-based exon files
Downloading gencode.csv.gz: 100%|█████████████████████████████████████████| 59.0M/59.0M [00:34<00:00, 1.70MB/s]
Unpacking.
Done.
Creating GENCODE-based exon files
Start parsing GTF file
Downloading gencode_intron.bed.gz: 8.74MB [00:03, 2.51MB/s]
Unpacking.
Done.
Downloading est.bed.gz: 444MB [03:20, 2.22MB/s]
Unpacking.
Done.
In the above example, the folder reference/hg38/
should now contain the following files occupying around 8GB of disk space.
est.bed
gencode.csv
gencode.csv.exon.bed
gencode.csv.exon.merge.bed
gencode_intron.bed
genome.chrom.sizes
genome.fa
refFlat.csv.gz
refFlat.csv.merged.bed
refFlat.csv.sort.bed
refFlat.csv.unique.bed
The file names are identical for each genome build, only the folder name indicates which genome is stored in each folder.
Step 2: Run the nanopore pipeline#
To run the main workflow of the circtools nanopore
module, users need to specify the reference genome (-R reference/
), output path (-O results/
), and the FASTQ file containing the Oxford Nanopore reads (-s human_nanopore.fastq.gz
). An example dataset consisting of 100k human brain nanopore reads is available for download. The --threads 16
argument is optional, but can be supplied to speed up processing by using multiple CPU threads, in this case 16 threads:
circtools.py nanopore -r -s human_nanopore.fastq.gz -R reference/ -C hg38 -O results/ --threads 16
The pipeline outputs a number of output files, specifically:
ls -la results/
human_nanopore.circ_circRNA_exon_usage_length_of_exons.txt
human_nanopore.circRNA_candidates.annotated.txt
human_nanopore.novel.cryptic.spliced.exons.txt
human_nanopore.novel.exons.2reads.filter.bed
human_nanopore.novel.exons.2reads.phases.tab
human_nanopore.Potential_multi-round_circRNA.fa
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.10reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.20reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.2reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.3reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.50reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.5reads.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.bed
human_nanopore.scan.circRNA.psl.split.merge.flank2.allExons.notGencode.bed
human_nanopore.scan.Potential_multi-round_circRNA.psl.annot.bed
human_nanopore.scan.Potential_multi-round_circRNA.psl.annot.count.txt
The files are prefixed with the sample name (input FASTQ file name minus extension) and are named intuitively. The main output file has the suffix circRNA_candidates.annotated.txt and contains the list of circRNAs detected in the run. Specifically, the files contains the following columns for each circRNA:
1 internal_circRNA_name
2 chr
3 start
4 end
5 description
6 BSJ_reads
7 strand
8 gene
9 reserved
10 reserved
11 reserved
12 mean_read_coverage
13 mean_gene_coverage
14 mean_exon_coverage
15 mean_EST_coverage
16 mean_intron_coverage
17 min_exon_adjust
18 max_exon_adjust
19 mean_exon_adjust