nf-core/circdna
Pipeline for the identification of extrachromosomal circular DNA (ecDNA) from Circle-seq, WGS, and ATAC-seq data that were generated from cancer and other eukaryotic cells.
1.0.0
). The latest
stable release is
1.1.0
.
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - Raw read QC
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
- TrimGalore - Read Trimming
- BWA - Read mapping to reference genome
- Samtools - Sorting, indexing, filtering & stats generation of BAM file
- Circle-Map Realign - Identifies putative circular DNA junctions
- Circle-Map Repeats - Identifies putative repetitive circular DNA
- CIRCexplorer2 - Identifies putative circular DNA junctions
- Circle_finder - Identifies putative circular DNA junctions
- AmpliconArchitect - Reconstruct the structure of focally amplified regions
- Unicycler - DeNovo Alignment of circular DNAs
General Tools
FastQC
Output files
fastqc/
*_fastqc.html
: FastQC report containing quality metrics.*_fastqc.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.
TrimGalore
Output files
trimgalore/
*_trimming_report.txt
: Trimgalore trimming report.fastqc/*_fastqc.zip
: Zip archive containing the FastQC report, tab-delimited data file and plot images.fastqc/*_fastqc.html
: FastQC report containing quality metrics.
TrimGalore combines the trimming tool Cutadapt for the removal of adapter sequences, primers and other unwanted sequences with the quality control tool FastQC
BWA
BWA is a software package for mapping low-divergent sequences against a large reference genome.
Such files are intermediate and not kept in the final files delivered to users.
Output files
Output directory: results/Reports/[SAMPLE]/SamToolsStats
[SAMPLE].bam
- Alignment file containing information about the read alignment to the reference genome
Samtools
samtools stats
samtools stats collects statistics from BAM
files and outputs in a text format.
Plots will show:
- Alignment metrics.
Output directory: results/Reports/[SAMPLE]/SamToolsStats
[SAMPLE].bam.samtools.stats.out
- Raw statistics used by
MultiQC
- Raw statistics used by
For further reading and documentation see the samtools
manual
Mark Duplicates
GATK MarkDuplicates
By default, circdna
will use GATK MarkDuplicates, which locates and tags duplicate reads in a BAM
or SAM
file, where duplicate reads are defined as originating from a single fragment of DNA.
Output directory: results/markduplicates/bam
[SAMPLE].md.bam
and[SAMPLE].md.bai
BAM
file and index
For further reading and documentation see the data pre-processing for variant discovery from the GATK best practices.
Samtools view - Duplicates Filtering
By default, circdna
removes all duplicates marked by GATK MarkDuplicates using samtools view
Output directory: results/markduplicates/duplicates_removed
[SAMPLE].md.filtered.sorted.bam
and[SAMPLE].md.filtered.sorted.bai
BAM
file and index
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
circdna branches
Branch: circle_finder
Circle_finder
Output files
Output directory: results/circlefinder/
[SAMPLE].microDNA-JT.txt
BED
file containing information about putative circular DNA regions
Circle_finder identifies putative circular DNA junctions from paired-end sequencing data. Circle_finder
uses split and discordant read information to identify junctions that could be generated through the formation of ecDNAs. For more information please see Circle_finder.
Branch: circexplorer2
CIRCexplorer2
CIRCexplorer2 identifies putative circular DNA junctions from paired-end sequencing data. CIRCexplorer2
was developed to identify circular RNAs from RNA-seq data. However, it can be also used to call putative circular DNAs from genomic data. For more information see CIRCexplorer2 docs
Output files
**Output directory: `results/circexplorer2/`**[SAMPLE].circexplorer_circdna.bed
BED
file containing information about putative circular DNA regions
[SAMPLE].CIRCexplorer2_parse.log
log
file
Branch: circle_map_realign
circle_map_realign
uses the functionality of Circle-Map
to call putative circular DNAs from mappable regions. To identify circular DNAs it uses information about split and discordant reads and uses realignment steps to identify the exact breakpoint of the circular DNA. For more information, please see the original paper or the GitHub Page
Circle-Map Readextractor
Circle-Map Readextractor extracts read candidates for circular DNA identification.
Output files
Output directory: results/circlemap/readextractor
[SAMPLE].qname.sorted.circular_read_candidates.bam
BAM
file containing candidate reads
Circle-Map Realign
Circle-Map Realign detects putative circular DNA junctions from read candidates extracted by Circle-Map Readextractor
Output files
Output directory: results/circlemap/realign
[SAMPLE]_circularDNA_coordinates.bed
BED
file containing information about putative circular DNA regions
Branch: circle_map_repeats
Circle-Map Readextractor
Circle-Map Readextractor extracts read candidates for circular DNA identification.
Output files
Output directory: results/circlemap/readextractor
[SAMPLE].qname.sorted.circular_read_candidates.bam
BAM
file containing candidate reads
Circle-Map Repeats
Circle-Map Repeats identifies chromosomal coordinates from repetetive circular DNAs.
Output files
Output directory: results/circlemap/repeats
[SAMPLE]_circularDNA_repeats_coordinates.bed
BED
file containing information about repetetive circular DNAs
Branch: unicycler
This Branch utilises the ability of Unicycler to denovo assemble circular DNAs in combination with the long read mapping capabilities of Minimap2, to identify the origin of the circular DNAs.
Unicycler
Unicycler was originally built as an assembly pipeline for bacterial genomes. In nf-core/circdna
it is used to denovo assemble circular DNAs.
Output files
Output directory: results/unicycler/
[SAMPLE].assembly.gfa.gz
gfa
file containing sequence of denovo assembled sequences
[SAMPLE].assembly.scaffolds.fa.gz
fasta
file containing sequences of denovo assembled sequences in fasta format with information if denovo assembled seoriginated from a circular DNA.quence forms a circular contig.
Minimap2
Minimap2 uses circular DNA sequences identified by Unicycler and maps it to the given reference genome.
Output files
Output directory: results/unicycler/minimap2
[SAMPLE].paf
paf
file containing mapping information of circular DNA sequences
Branch: ampliconarchitect
This pipeline branch ampliconarchitect
is only usable with WGS data. This branch uses the utility of PrepareAA to collect amplified seeds from copy number calls, which will be then fed to AmpliconArchitect to characterise amplicons in each given sample.
CNVkit
CNVkit uses alignment information to make copy number calls. These copy number calls will be used by AmpliconArchitect to identify circular and other types of amplicons. The Copy Number calls are then connected to seeds and filtered based on the copy number threshold using scripts provided by PrepareAA
Output files
Output directory: results/ampliconarchitect/cnvkit
[SAMPLE]_CNV_GAIN.bed
bed
file containing filtered Copy Number calls
[SAMPLE]_AA_CNV_SEEDS.bed
bed
file containing filtered and connected amplified regions (seeds). This is used as input for AmpliconArchitect
[SAMPLE].cnvkit.segment.cns
cns
file containing copy number calls of CNVkit segment.
AmpliconArchitect
AmpliconArchitect uses amplicon seeds provided by CNVkit
and PrepareAA
to identify different types of amplicons in each sample.
Output files
Output directory: results/ampliconarchitect/ampliconarchitect
amplicons/[SAMPLE]_[AMPLICONID]_cycles.txt
txt
file describing the amplicon segments
amplicons/[SAMPLE]_[AMPLICONID]_graph.txt
txt
file describing the amplicon graph
cnseg/[SAMPLE]_[SEGMENT]_graph.txt
txt
file describing the copy number segmentation file
summary/[SAMPLE]_summary.txt
txt
file describing each amplicon with regards to breakpoints, composition, oncogene content, copy number
sv_view/[SAMPLE]_[AMPLICONID].{png,pdf}
png
orpdf
file displaying the amplicon rearrangement signature
AmpliconClassifier
AmpliconClassifier classifies each amplicon by using the cycles
and the graph
files generated by AmpliconArchitect
.
Output files
Output directory: results/ampliconarchitect/ampliconclassifier
input/[SAMPLE].AmpliconClassifier.input
txt
file containing the input used forAmpliconClassifier
andAmpliconSimilarity
.
classification/[SAMPLE]_amplicon_classification_profiles.tsv
tsv
file describing the amplicon class of each amplicon in the sample.
ecDNA_counts/[SAMPLE]_ecDNA_counts.tsv
tsv
file describing if an amplicon is circular [1 = circular, 0 = non-circular].
gene_list/[SAMPLE]_gene_list.tsv
tsv
file detailing the genes on each amplicon.
log/[SAMPLE].classifier_stdout.log
log
file
similarity/[SAMPLE]_similarity_scores.tsv
tsv
file containing amplicon similarity scores calculated byAmpliconSimilarity
.
bed/[SAMPLE]_amplicon[AMPLICONID]_[CLASSIFICATION]_[ID]_intervals.bed
bed
files containing information about the intervals on each amplicon.unknown
intervals were not identified to be located on the respective amplicon.
AmpliconArchitect Summary
The Summary
script merges the output of AmpliconArchitect
and AmpliconClassifer
to give full information about each amplicon in a sample. Please refer to AmpliconClassifier for more accurate ecDNA interval calling. Some intervals classified in the AmpliconArchitect
and Summary
output are not located on ecDNAs.
Output files
Output directory: results/ampliconarchitect/summary
[SAMPLE].aa_results_summary.tsv
tsv
file containing the merged results.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.