2. Parse2wig
Parse2wig preprocesses an input mapfile into bin data (the number of mapped read per bin). A sample script file for the tutorial can be found in the “tutorial” directory.
2.1. Example
The command below generates the bigWig data ChIP.100.bw
and the statistics file ChIP.100.tsv
in the parse2wigdir+
directory:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt
(...)
$ ls parse2wigdir+
parse2wigdir+/ChIP.100.bw parse2wigdir+/ChIP.100.tsv
The default bin size is 100 bp. The input file format is automatically detected by postfix (.sam/.bam/.cram/.bowtie/.tagalign(.gz)).
If the detection does not work properly, add the -f
option (e.g., -f BAM
).
Note
If you are using the docker image to execute parse2wig+, it is necessary to mount the directory by -v
option to access the input files as follows:
docker run -it --rm -v $(pwd):/mnt rnakato/ssp_drompa parse2wig+ \
-i /mnt/ChIP.bam -o ChIP --odir /mnt/parse2wigdir+ --gt /mnt/genometable.txt
This command mounts the current directory to /mnt directory in the container. Please see also the document of Docker.
parse2wig+ allows multiple input files (separated by “,”):
$ parse2wig+ -i ChIP1.bam,ChIP2.bam,ChIP3.bam -o ChIP --gt genometable.txt
If you want to generate it in bedGraph format, add --outputformat 2
:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --outputformat 2
By default, parse2wig+ omits to output bins in which the value is zero to reduce the file size. To output all bins, add --outputzero
:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --outputzero
For bin size of 100 kbp:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --binsize 100000
To use multiple CPUs, add -p
:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt -p 4
Note
Multithreading is activated in strand-shift profile for estimating the fragment length and GC content. When adding the
--nomodel
option and omitting--chrdir
option, multithreading will make no differece to single-core mode (-p 1
).
By default, parse2wig+ calculates quality scores using only autosomes, i.e., ‘chrN’, where N is a numeric number, in the genome_table file.
However, in some cases the chromosome names do not start with “chr”.
In such a case, add the --include_allchr
option to include all chromosomes in the genome_table file.
(See also https://github.com/rnakato/DROMPAplus/issues/8):
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt -p 4 --include_allchr
2.2. Quality check
The statistics file parse2wigdir+/ChIP.100.tsv
describes the following quality values and other information of the input mapfile:
Redundancy threshold: the threshold for PCR bias
Library complexity: the non-redundant read fraction for 10 million mapped reads (add
--ncmp
to change this mapped read number). The score is put in parentheses when the number of mapped reads is insufficient.GC summit (when adding the
--GC
option): the summit of the GC distributionLength: the total length of the genome and chromosomes
Mappable base and mappability: the mappability calculated from a specified mappability file
Total reads: the number of mapped reads
Non-redundant reads: the number of reads remaining after PCR-bias filtering
Redundant reads: the number of reads filtered by PCR-bias filtering (mapped on forward and reverse and both strands are outputted)
Reads (GCnormed) (when adding the
--GC
option): the read number after GC normalizationRead depth: the expected number of the mapped reads per base pair. Namely, \(depth = n^{reads} * fragmentlength / genomelength\);
Scaling weight: the scaling weight for the read normalization
Normalized read number: the read number after the normalization
FRiP score (when adding the
--bed
option): the fraction of reads in peaks. .
2.2.1. Summarize statistics files of multiple files
parsestats4DROMPAplus.pl
in scripts direcoty can parse the statistics file generated by parse2wig+ as below:
$ parsestats4DROMPAplus.pl parse2wigdir+/ChIP.100.tsv
Sample Mapped reads + strand - strand Redundancy threshold Nonredundant Redundant Complexity for10M Read depth Genome coverage Tested_reads
ChIP 22938522 11471385 11467137 >19 21219975 (92.5%) 1718547 (7.5%) 0.968 295.746 0.998 9675745/9999144
The first raw is a header line (outputted to stderr) and the seconda one is the parsed statistics values of the data (outputted to stdout).
To summarize stats of multiple files to a single tsv file, type for instance (assume the four ChIP samples “ChIP1-ChIP4”):
$ rm stats.tsv
$ for sample in ChIP1 ChIP2 ChIP3 ChIP4; do
$ parsestats4DROMPAplus.pl parse2wigdir+/$sample.100.tsv >> stats.tsv
$ done
This summarization is useful to view a quality statistics summary by Excel or libreoffice --calc
command.
2.2.2. Fragment length estimation
For single-end data, parse2wig+ internally uses SSP to estimate the averaged fragment length and extends to it.
In default, parse2wig+ uses the longest chromosme that contains the most mappable bases to estimate fragment length. To estimate it more precisely, supply
--allchr
option to use all chromosomes (recommend: with multithreading option-p
).When the
--verbose
option is used, parse2wig+ generates .pdf and.tsv files of the strand-shift profile as SSP does. They are useful to check whether the estimated fragment length is reasonable.When the
--nomodel
option is used, parse2wig+ omits the use of SSP and extends the read to a predetermined length (150 bp by default). Add the--flen
option to change the default value.
2.2.3. Paired-end file
Add the --pair
option for paired-end files:
$ parse2wig+ --pair -i ChIP.paired.bam -o ChIP --gt genometable.txt
In the --pair
mode, the fragment length of each read pair is calculated automatically. parse2wig+ discards the read pairs that are mapped onto different chromosomes or with fragment lengths longer than 500 bp (by default). To set a specified length, add --maxins
.
Note
When parsing paired-end mapfiles with single-end mode, warning messages will be outputted.
In TagAlign format, paired-end data is not supported.
2.2.4. PCR-bias filtering
parse2wig+ filters “redundant reads” (reads starting exactly at the same 5΄ ends) as PCR bias.
This filtering step can be omitted by adding the --nofilter
option.
By default, the threshold of filtering is defined as:
threshold = max(1, 10 * E_genome)
where Egenome is the averaged read depth.
10 * Egenome can be greater than 1 for a small genome (e.g., yeast).
Additionally, the --thre_pb
option can be used to fix this threshold.
2.2.5. Multiple-mapped reads
parse2wig+ recognizes the uniquely mapped and multiple-mapped reads using an NH flag in SAM/BAM/CRAM format. For multiple-mapped reads, each mapped locus is weighted equally.
Some mapping tools (e.g., Bowtie and BWA) do not output the NH column. In this case, all reads are considered as uniquely mapped reads.
2.3. Total read normalization
parse2wig+ utilizes the -n
option to normalize the read distribution based on the number of nonredundant reads.
-n NONE (default); not normalized
-n GR; for the whole genome, read number
-n GD; for the whole genome, read depth
-n CR; for each chromosome, read number
-n CD; for each chromosome, read depth
-n GR
is recommended as the typical total read normalization.
If the mapped read number is quite different among chromosomes (e.g., mapfile contains chrX only), consider using -n CR
.
Additionally, use the --nrpm
option to change the read number after normalization (default: 20 million).
For example, the command below scales the bin data so that the total number of non-redundant reads is 10 million:
$ parse2wig+ -i sample.sam -o sample --gt genometable.txt -n GR --nrpm 10000000
Note
Scaling up a small number of reads (e.g., 1 million → 10 million) is not recommended because it increases the background noise.
2.4. High resolution with central regions of fragments
When high resolution is required (e.g., nucleosome-seq), consider using the --rcenter
option that focuses on the central region of each fragment.
For example, the command below considers only 50 bp around the center of each fragment:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --rcenter 50
2.5. Mappability information
parse2wig+ utilizes three types of genome mappability information. See the Appendix for details on generating the mappability files.
2.5.1. Mappable chromosome length
With the --mptable
option, parse2wig+ considers the number of mappable bases as the genome/chromosome length.:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --mptable mptable.txt
The mappability files for several species are available in the “data/mptable” directory. When --mptable
is not supplied, all bases are considered mappable.
2.5.2. Base-pair level mappability
To precisely calculate the genome coverage and/or GC content distribution in base-pair resolution, add the --mpdir
option as follows:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --mpdir <mpdir>
where <mpdir> indicates the directory that contains the gzipped binary mappability files (map_chr*_binary.txt.gz). The mappability files for several species are available on our Google Drive account.
2.5.3. Bin-level mappability
When adding the --mpdir
option, parse2wig+ automatically generates the bin-level mappability files (map_chr*.<binsize>.wig.gz). These files are used to normalize the wig data based on the mappability. The bins with mappability lower than the threshold (--mpthre
option, < 0.3 by default) are excluded from the mappability normalization (and GC normalization).
2.6. GC content estimation
parse2wig+ can estimate and normalize the GC content of the mapped reads as follows:
$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt \
--chrdir <chromosomedir>
where the --chrdir
option indicates the directory of the reference chromosome FASTA files.
<chromosomedir> is the directory containing the FASTA files of all chromosomes described in genometable.txt
with corresponding filenames.
For example, if chr1
is in genometable.txt
, chr1.fa
should be in <chromosomedir>.
parse2wig+ uses the longest chromosome described in mptable.txt
or genometable.txt
for the GC content estimation.
In GC content estimation, parse2wig+ considers 120 bp except for 5 bases of 5΄ edge (i.e., from 6 bp to 125 bp for each fragment) because the 5΄ edge often contains a biased GC distribution. Use --flen4gc
to change the length to be considered.
2.6.1. GC stats file
The abovementioned command outputs the GC distribution file “ChIP.GCdist.tsv” in the output directory (parse2wig+dir). Using this GC distribution file, the user can draw the GC contents/weight distribution of the input file and the genome sequence, as shown in Fig. 2.1.

Fig. 2.1 The GC percentage as a function of the proportion and scaling weight.
The contents are the following:
GC: the GC content;
genome prop: the proportion of the mappable bases containing the GC contents. Then, \(prop^{genome}_{GC} = n^{genome}_{GC}/G\), where \(n^{genome}_{GC}\) is the number of positions containing the GC contents and \(G\) is the total number of mappable bases;
read prop: the proportion of the reads (fragments) containing the GC contents. Then \(prop^{reads}_{GC} = n^{reads}_{GC}/N\), where \(n^{reads}_{GC}\) is the number of reads containing the GC contents and \(N\) is the total number of mapped reads;
depth: the ratio of the GC contents between reads and genome sequence; namely, \(depth_{GC} = n^{reads}_{GC}/n^{genome}_{GC}\);
Scaling weight: the ratio of the proportion between reads and genome sequence; namely, \(weight = prop^{genome}_{GC}/prop^{reads}_{GC}\);
Note: because the weight estimated from very low \(depth_{GC}\) causes false-positive peaks, by default parse2wig+ sets a weight of 1 to the GC content with \(depth_{GC}\) less than 0.001 and a weight of 0 to the GC content with \(prop^{genome}_{GC}\) less than 0.00001. The former threshold is ignored when adding the
--gcdepthoff
option.
The summit of the GC content distribution for reads (orange line, GC% = 61 in Fig. 2.1) is important for assessing the GC bias. This score is also outputted in the stats file (e.g., H3K4me3.100.tsv).
2.6.2. GC normalization
When adding the --chrdir
option, the output wig data describes the read distribution normalized by the GC contents, where each read is scaled based on its GC content. However, it should be noted that GC normalization often overcorrects the true read signals. When samples have a different GC distribution compared with other samples, it is preferable to re-prepare them rather than use them with GC normalization.