2. Parse2wig

Parse2wig preprocesses an input mapfile into bin data (the number of mapped read per bin). A sample script file for the tutorial can be found in the “tutorial” directory.

2.1. Example

The command below generates the bigWig data ChIP.100.bw and the statistics file ChIP.100.tsv in the parse2wigdir+ directory:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt
(...)
$ ls parse2wigdir+
parse2wigdir+/ChIP.100.bw   parse2wigdir+/ChIP.100.tsv

The default bin size is 100 bp. The input file format is automatically detected by postfix (.sam/.bam/.cram/.bowtie/.tagalign(.gz)). If the detection does not work properly, add the -f option (e.g., -f BAM).

Note

If you are using the docker image to execute parse2wig+, it is necessary to mount the directory by -v option to access the input files as follows:

docker run -it --rm -v $(pwd):/mnt rnakato/ssp_drompa parse2wig+ \
       -i /mnt/ChIP.bam -o ChIP --odir /mnt/parse2wigdir+ --gt /mnt/genometable.txt

This command mounts the current directory to /mnt directory in the container. Please see also the document of Docker.

parse2wig+ allows multiple input files (separated by “,”):

$ parse2wig+ -i ChIP1.bam,ChIP2.bam,ChIP3.bam -o ChIP --gt genometable.txt

If you want to generate it in bedGraph format, add --outputformat 2:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --outputformat 2

By default, parse2wig+ omits to output bins in which the value is zero to reduce the file size. To output all bins, add --outputzero:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --outputzero

For bin size of 100 kbp:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --binsize 100000

To use multiple CPUs, add -p:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt -p 4

Note

  • Multithreading is activated in strand-shift profile for estimating the fragment length and GC content. When adding the --nomodel option and omitting --chrdir option, multithreading will make no differece to single-core mode (-p 1).

By default, parse2wig+ calculates quality scores using only autosomes, i.e., ‘chrN’, where N is a numeric number, in the genome_table file. However, in some cases the chromosome names do not start with “chr”. In such a case, add the --include_allchr option to include all chromosomes in the genome_table file. (See also https://github.com/rnakato/DROMPAplus/issues/8):

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt -p 4 --include_allchr

2.2. Quality check

The statistics file parse2wigdir+/ChIP.100.tsv describes the following quality values and other information of the input mapfile:

  • Redundancy threshold: the threshold for PCR bias

  • Library complexity: the non-redundant read fraction for 10 million mapped reads (add --ncmp to change this mapped read number). The score is put in parentheses when the number of mapped reads is insufficient.

  • GC summit (when adding the --GC option): the summit of the GC distribution

  • Length: the total length of the genome and chromosomes

  • Mappable base and mappability: the mappability calculated from a specified mappability file

  • Total reads: the number of mapped reads

  • Non-redundant reads: the number of reads remaining after PCR-bias filtering

  • Redundant reads: the number of reads filtered by PCR-bias filtering (mapped on forward and reverse and both strands are outputted)

  • Reads (GCnormed) (when adding the --GC option): the read number after GC normalization

  • Read depth: the expected number of the mapped reads per base pair. Namely, \(depth = n^{reads} * fragmentlength / genomelength\);

  • Scaling weight: the scaling weight for the read normalization

  • Normalized read number: the read number after the normalization

  • FRiP score (when adding the --bed option): the fraction of reads in peaks. .

2.2.1. Summarize statistics files of multiple files

parsestats4DROMPAplus.pl in scripts direcoty can parse the statistics file generated by parse2wig+ as below:

  $ parsestats4DROMPAplus.pl parse2wigdir+/ChIP.100.tsv
  Sample      Mapped reads     + strand        - strand       Redundancy threshold    Nonredundant    Redundant       Complexity for10M       Read depth      Genome coverage Tested_reads
ChIP  22938522        11471385        11467137        >19     21219975 (92.5%)        1718547 (7.5%)  0.968   295.746 0.998   9675745/9999144

The first raw is a header line (outputted to stderr) and the seconda one is the parsed statistics values of the data (outputted to stdout).

To summarize stats of multiple files to a single tsv file, type for instance (assume the four ChIP samples “ChIP1-ChIP4”):

$ rm stats.tsv
$ for sample in ChIP1 ChIP2 ChIP3 ChIP4; do
$    parsestats4DROMPAplus.pl parse2wigdir+/$sample.100.tsv >> stats.tsv
$ done

This summarization is useful to view a quality statistics summary by Excel or libreoffice --calc command.

2.2.2. Fragment length estimation

  • For single-end data, parse2wig+ internally uses SSP to estimate the averaged fragment length and extends to it.

  • In default, parse2wig+ uses the longest chromosme that contains the most mappable bases to estimate fragment length. To estimate it more precisely, supply --allchr option to use all chromosomes (recommend: with multithreading option -p).

  • When the --verbose option is used, parse2wig+ generates .pdf and.tsv files of the strand-shift profile as SSP does. They are useful to check whether the estimated fragment length is reasonable.

  • When the --nomodel option is used, parse2wig+ omits the use of SSP and extends the read to a predetermined length (150 bp by default). Add the --flen option to change the default value.

2.2.3. Paired-end file

Add the --pair option for paired-end files:

$ parse2wig+ --pair -i ChIP.paired.bam -o ChIP --gt genometable.txt

In the --pair mode, the fragment length of each read pair is calculated automatically. parse2wig+ discards the read pairs that are mapped onto different chromosomes or with fragment lengths longer than 500 bp (by default). To set a specified length, add --maxins.

Note

  • When parsing paired-end mapfiles with single-end mode, warning messages will be outputted.

  • In TagAlign format, paired-end data is not supported.

2.2.4. PCR-bias filtering

parse2wig+ filters “redundant reads” (reads starting exactly at the same 5΄ ends) as PCR bias. This filtering step can be omitted by adding the --nofilter option.

By default, the threshold of filtering is defined as:

threshold = max(1, 10 * E_genome)

where Egenome is the averaged read depth. 10 * Egenome can be greater than 1 for a small genome (e.g., yeast). Additionally, the --thre_pb option can be used to fix this threshold.

2.2.5. Multiple-mapped reads

parse2wig+ recognizes the uniquely mapped and multiple-mapped reads using an NH flag in SAM/BAM/CRAM format. For multiple-mapped reads, each mapped locus is weighted equally.

Some mapping tools (e.g., Bowtie and BWA) do not output the NH column. In this case, all reads are considered as uniquely mapped reads.

2.3. Total read normalization

parse2wig+ utilizes the -n option to normalize the read distribution based on the number of nonredundant reads.

  • -n NONE (default); not normalized

  • -n GR; for the whole genome, read number

  • -n GD; for the whole genome, read depth

  • -n CR; for each chromosome, read number

  • -n CD; for each chromosome, read depth

-n GR is recommended as the typical total read normalization. If the mapped read number is quite different among chromosomes (e.g., mapfile contains chrX only), consider using -n CR. Additionally, use the --nrpm option to change the read number after normalization (default: 20 million).

For example, the command below scales the bin data so that the total number of non-redundant reads is 10 million:

$ parse2wig+ -i sample.sam -o sample --gt genometable.txt -n GR --nrpm 10000000

Note

Scaling up a small number of reads (e.g., 1 million → 10 million) is not recommended because it increases the background noise.

2.4. High resolution with central regions of fragments

When high resolution is required (e.g., nucleosome-seq), consider using the --rcenter option that focuses on the central region of each fragment. For example, the command below considers only 50 bp around the center of each fragment:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --rcenter 50

2.5. Mappability information

parse2wig+ utilizes three types of genome mappability information. See the Appendix for details on generating the mappability files.

2.5.1. Mappable chromosome length

With the --mptable option, parse2wig+ considers the number of mappable bases as the genome/chromosome length.:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --mptable mptable.txt

The mappability files for several species are available in the “data/mptable” directory. When --mptable is not supplied, all bases are considered mappable.

2.5.2. Base-pair level mappability

To precisely calculate the genome coverage and/or GC content distribution in base-pair resolution, add the --mpdir option as follows:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --mpdir <mpdir>

where <mpdir> indicates the directory that contains the gzipped binary mappability files (map_chr*_binary.txt.gz). The mappability files for several species are available on our Google Drive account.

2.5.3. Bin-level mappability

When adding the --mpdir option, parse2wig+ automatically generates the bin-level mappability files (map_chr*.<binsize>.wig.gz). These files are used to normalize the wig data based on the mappability. The bins with mappability lower than the threshold (--mpthre option, < 0.3 by default) are excluded from the mappability normalization (and GC normalization).

2.6. GC content estimation

parse2wig+ can estimate and normalize the GC content of the mapped reads as follows:

$ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt \
  --chrdir <chromosomedir>

where the --chrdir option indicates the directory of the reference chromosome FASTA files. <chromosomedir> is the directory containing the FASTA files of all chromosomes described in genometable.txt with corresponding filenames. For example, if chr1 is in genometable.txt, chr1.fa should be in <chromosomedir>. parse2wig+ uses the longest chromosome described in mptable.txt or genometable.txt for the GC content estimation.

In GC content estimation, parse2wig+ considers 120 bp except for 5 bases of 5΄ edge (i.e., from 6 bp to 125 bp for each fragment) because the 5΄ edge often contains a biased GC distribution. Use --flen4gc to change the length to be considered.

2.6.1. GC stats file

The abovementioned command outputs the GC distribution file “ChIP.GCdist.tsv” in the output directory (parse2wig+dir). Using this GC distribution file, the user can draw the GC contents/weight distribution of the input file and the genome sequence, as shown in Fig. 2.1.

Alternate

Fig. 2.1 The GC percentage as a function of the proportion and scaling weight.

The contents are the following:

  • GC: the GC content;

  • genome prop: the proportion of the mappable bases containing the GC contents. Then, \(prop^{genome}_{GC} = n^{genome}_{GC}/G\), where \(n^{genome}_{GC}\) is the number of positions containing the GC contents and \(G\) is the total number of mappable bases;

  • read prop: the proportion of the reads (fragments) containing the GC contents. Then \(prop^{reads}_{GC} = n^{reads}_{GC}/N\), where \(n^{reads}_{GC}\) is the number of reads containing the GC contents and \(N\) is the total number of mapped reads;

  • depth: the ratio of the GC contents between reads and genome sequence; namely, \(depth_{GC} = n^{reads}_{GC}/n^{genome}_{GC}\);

  • Scaling weight: the ratio of the proportion between reads and genome sequence; namely, \(weight = prop^{genome}_{GC}/prop^{reads}_{GC}\);

    • Note: because the weight estimated from very low \(depth_{GC}\) causes false-positive peaks, by default parse2wig+ sets a weight of 1 to the GC content with \(depth_{GC}\) less than 0.001 and a weight of 0 to the GC content with \(prop^{genome}_{GC}\) less than 0.00001. The former threshold is ignored when adding the --gcdepthoff option.

The summit of the GC content distribution for reads (orange line, GC% = 61 in Fig. 2.1) is important for assessing the GC bias. This score is also outputted in the stats file (e.g., H3K4me3.100.tsv).

2.6.2. GC normalization

When adding the --chrdir option, the output wig data describes the read distribution normalized by the GC contents, where each read is scaled based on its GC content. However, it should be noted that GC normalization often overcorrects the true read signals. When samples have a different GC distribution compared with other samples, it is preferable to re-prepare them rather than use them with GC normalization.