Parse2wig
============

Parse2wig preprocesses an input mapfile into bin data (the number of mapped read per bin). A sample script file for the tutorial can be found in the "tutorial" directory.


Example
-------------------------------

The command below generates the bigWig data ``ChIP.100.bw`` and the statistics file ``ChIP.100.tsv`` in the ``parse2wigdir+`` directory::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt
  (...)
  $ ls parse2wigdir+
  parse2wigdir+/ChIP.100.bw   parse2wigdir+/ChIP.100.tsv

The default bin size is 100 bp. The input file format is automatically detected by postfix (.sam/.bam/.cram/.bowtie/.tagalign(.gz)).
If the detection does not work properly, add the ``-f`` option (e.g., ``-f BAM``).

.. note::

    If you are using the docker image to execute parse2wig+, it is necessary to mount the directory by ``-v`` option to access the input files as follows::

        docker run -it --rm -v $(pwd):/mnt rnakato/ssp_drompa parse2wig+ \
               -i /mnt/ChIP.bam -o ChIP --odir /mnt/parse2wigdir+ --gt /mnt/genometable.txt

    This command mounts the current directory to /mnt directory in the container. 
    Please see also `the document of Docker <https://docs.docker.com/storage/volumes/>`_.

parse2wig+ allows multiple input files (separated by ",")::

  $ parse2wig+ -i ChIP1.bam,ChIP2.bam,ChIP3.bam -o ChIP --gt genometable.txt

If you want to generate it in bedGraph format, add ``--outputformat 2``::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --outputformat 2

By default, parse2wig+ omits to output bins in which the value is zero to reduce the file size. To output all bins, add ``--outputzero``::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --outputzero

For bin size of 100 kbp::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --binsize 100000

To use multiple CPUs, add ``-p``::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt -p 4

.. note::

    * Multithreading is activated in strand-shift profile for estimating the fragment length and GC content. When adding the ``--nomodel`` option and omitting ``--chrdir`` option, multithreading will make no differece to single-core mode (``-p 1``).

By default, parse2wig+ calculates quality scores using only autosomes, i.e., 'chrN', where N is a numeric number, in the genome_table file.
However, in some cases the chromosome names do not start with "chr".
In such a case, add the ``--include_allchr`` option to include all chromosomes in the genome_table file.
(See also https://github.com/rnakato/DROMPAplus/issues/8)::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt -p 4 --include_allchr

Quality check
------------------------

The statistics file ``parse2wigdir+/ChIP.100.tsv`` describes the following quality values and other information of the input mapfile:

- **Redundancy threshold**: the threshold for PCR bias
- **Library complexity**: the non-redundant read fraction for 10 million mapped reads (add ``--ncmp`` to change this mapped read number). The score is put in parentheses when the number of mapped reads is insufficient.
- **GC summit** (when adding the ``--GC`` option): the summit of the GC distribution
- **Length**: the total length of the genome and chromosomes
- **Mappable base and mappability**: the mappability calculated from a specified mappability file
- **Total reads**: the number of mapped reads
- **Non-redundant reads**: the number of reads remaining after PCR-bias filtering
- **Redundant reads**: the number of reads filtered by PCR-bias filtering (mapped on forward and reverse and both strands are outputted)
- **Reads (GCnormed)** (when adding the ``--GC`` option): the read number after GC normalization
- **Read depth**: the expected number of the mapped reads per base pair. Namely, :math:`depth = n^{reads} * fragmentlength / genomelength`;
- **Scaling weight**: the scaling weight for the read normalization
- **Normalized read number**: the read number after the normalization
- **FRiP score** (when adding the ``--bed`` option): the fraction of reads in peaks. .

Summarize statistics files of multiple files
+++++++++++++++++++++++++++++++++++++++++++++++++++++

``parsestats4DROMPAplus.pl`` in `scripts` direcoty can parse the statistics file generated by parse2wig+ as below::

    $ parsestats4DROMPAplus.pl parse2wigdir+/ChIP.100.tsv
    Sample	Mapped reads	 + strand	 - strand	Redundancy threshold	Nonredundant	Redundant	Complexity for10M	Read depth	Genome coverage	Tested_reads
  ChIP	22938522	11471385	11467137	>19	21219975 (92.5%)	1718547 (7.5%)	0.968	295.746	0.998	9675745/9999144

The first raw is a header line (outputted to stderr) and the seconda one is the parsed statistics values of the data (outputted to stdout).

To summarize stats of multiple files to a single tsv file, type for instance (assume the four ChIP samples "ChIP1-ChIP4")::

    $ rm stats.tsv
    $ for sample in ChIP1 ChIP2 ChIP3 ChIP4; do
    $    parsestats4DROMPAplus.pl parse2wigdir+/$sample.100.tsv >> stats.tsv
    $ done

This summarization is useful to view a quality statistics summary by Excel or ``libreoffice --calc`` command.

Fragment length estimation
+++++++++++++++++++++++++++++++++++

- For single-end data, **parse2wig+** internally uses `SSP <https://github.com/rnakato/SSP>`_ to estimate the averaged fragment length and extends to it.
- In default, **parse2wig+** uses the longest chromosme that contains the most mappable bases to estimate fragment length. To estimate it more precisely, supply ``--allchr`` option to use all chromosomes (recommend: with multithreading option ``-p``).
- When the ``--verbose`` option is used, **parse2wig+** generates .pdf and.tsv files of the strand-shift profile as SSP does. They are useful to check whether the estimated fragment length is reasonable.
- When the ``--nomodel`` option is used, **parse2wig+** omits the use of SSP and extends the read to a predetermined length (150 bp by default). Add the ``--flen`` option to change the default value.

Paired-end file
+++++++++++++++++++++++++++++++++++

Add the ``--pair`` option for paired-end files::

  $ parse2wig+ --pair -i ChIP.paired.bam -o ChIP --gt genometable.txt

In the ``--pair`` mode, the fragment length of each read pair is calculated automatically. parse2wig+ discards the read pairs that are mapped onto different chromosomes or with fragment lengths longer than 500 bp (by default). To set a specified length, add ``--maxins``.

.. note::

   * When parsing paired-end mapfiles with single-end mode, warning messages will be outputted.
   * In TagAlign format, paired-end data is not supported.

PCR-bias filtering
++++++++++++++++++++++

parse2wig+ filters "redundant reads" (reads starting exactly at the same 5΄ ends) as PCR bias.
This filtering step can be omitted by adding the ``--nofilter`` option.

By default, the threshold of filtering is defined as::

	threshold = max(1, 10 * E_genome)

where E\ :sub:`genome`\  is the averaged read depth.
10 * E\ :sub:`genome`\  can be greater than 1 for a small genome (e.g., yeast).
Additionally, the ``--thre_pb`` option can be used to fix this threshold.


Multiple-mapped reads
++++++++++++++++++++++++++++++

parse2wig+ recognizes the uniquely mapped and multiple-mapped reads using an NH flag in SAM/BAM/CRAM format. For multiple-mapped reads, each mapped locus is weighted equally.

Some mapping tools (e.g., Bowtie and BWA) do not output the NH column. In this case, all reads are considered as uniquely mapped reads.

Total read normalization
---------------------------------

parse2wig+ utilizes the ``-n`` option to normalize the read distribution based on the number of nonredundant reads.

* **-n NONE** (default); not normalized
* **-n GR**; for the whole genome, read number
* **-n GD**; for the whole genome, read depth
* **-n CR**; for each chromosome, read number
* **-n CD**; for each chromosome, read depth

``-n GR`` is recommended as the typical total read normalization.
If the mapped read number is quite different among chromosomes (e.g., mapfile contains chrX only), consider using ``-n CR``.
Additionally, use the ``--nrpm`` option to change the read number after normalization (default: 20 million).

For example, the command below scales the bin data so that the total number of non-redundant reads is 10 million::

    $ parse2wig+ -i sample.sam -o sample --gt genometable.txt -n GR --nrpm 10000000

.. note::

       Scaling up a small number of reads (e.g., 1 million → 10 million) is not recommended because it increases the background noise.

High resolution with central regions of fragments
-------------------------------------------------------------

When high resolution is required (e.g., nucleosome-seq), consider using the ``--rcenter`` option that focuses on the central region of each fragment.
For example, the command below considers only 50 bp around the center of each fragment::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --rcenter 50

Mappability information
-----------------------------------------

parse2wig+ utilizes three types of genome mappability information.
See the :doc:`Appendix` for details on generating the mappability files.

Mappable chromosome length
+++++++++++++++++++++++++++++

With the ``--mptable`` option, parse2wig+ considers the number of mappable bases as the genome/chromosome length.::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --mptable mptable.txt

The mappability files for several species are available in the "data/mptable" directory. When ``--mptable`` is not supplied, all bases are considered mappable.


Base-pair level mappability
+++++++++++++++++++++++++++++

To precisely calculate the genome coverage and/or GC content distribution in base-pair resolution, add the ``--mpdir`` option as follows::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt --mpdir <mpdir>

where <mpdir> indicates the directory that contains the gzipped binary mappability files (**map_chr*_binary.txt.gz**).
The mappability files for several species are available on our `Google Drive account <https://drive.google.com/drive/folders/1GfKZkq3HIcMLQt-pZ_4bfwh21NyS2O-5?usp=sharing>`_.

Bin-level mappability
+++++++++++++++++++++++++++++

When adding the ``--mpdir`` option, parse2wig+ automatically generates the bin-level mappability files (**map_chr*.<binsize>.wig.gz**). These files are used to normalize the wig data based on the mappability. The bins with mappability lower than the threshold (``--mpthre`` option, < 0.3 by default) are excluded from the mappability normalization (and GC normalization).

GC content estimation
------------------------------

parse2wig+ can estimate and normalize the GC content of the mapped reads as follows::

  $ parse2wig+ -i ChIP.bam -o ChIP --gt genometable.txt \
    --chrdir <chromosomedir>

where the ``--chrdir`` option indicates the directory of the reference chromosome FASTA files.
<chromosomedir> is the directory containing the FASTA files of all chromosomes described in ``genometable.txt`` with corresponding filenames.
For example, if ``chr1`` is in ``genometable.txt``, ``chr1.fa`` should be in <chromosomedir>.
parse2wig+ uses the longest chromosome described in ``mptable.txt`` or ``genometable.txt`` for the GC content estimation.

In GC content estimation, parse2wig+ considers 120 bp except for 5 bases of 5΄ edge (i.e., from 6 bp to 125 bp for each fragment) because the 5΄ edge often contains a biased GC distribution. Use ``--flen4gc`` to change the length to be considered.

GC stats file
+++++++++++++++++++++

The abovementioned command outputs the GC distribution file "ChIP.GCdist.tsv" in the output directory (*parse2wig+dir*).
Using this GC distribution file, the user can draw the GC contents/weight distribution of the input file and the genome sequence, as shown in Fig. 2.1.

.. figure:: img/GCdist.H3K4me3.jpg
   :width: 500px
   :align: center
   :alt: Alternate

   The GC percentage as a function of the proportion and scaling weight.

The contents are the following:

- GC: the GC content;
- genome prop: the proportion of the mappable bases containing the GC contents. Then, :math:`prop^{genome}_{GC} = n^{genome}_{GC}/G`, where :math:`n^{genome}_{GC}` is the number of positions containing the GC contents and :math:`G` is the total number of mappable bases;
- read prop: the proportion of the reads (fragments) containing the GC contents. Then :math:`prop^{reads}_{GC} = n^{reads}_{GC}/N`, where :math:`n^{reads}_{GC}` is the number of reads containing the GC contents and :math:`N` is the total number of mapped reads;
- depth: the ratio of the GC contents between reads and genome sequence; namely, :math:`depth_{GC} = n^{reads}_{GC}/n^{genome}_{GC}`;
- Scaling weight: the ratio of the proportion between reads and genome sequence; namely, :math:`weight = prop^{genome}_{GC}/prop^{reads}_{GC}`;

      - Note: because the weight estimated from very low :math:`depth_{GC}` causes false-positive peaks, by default parse2wig+ sets a weight of 1 to the GC content with :math:`depth_{GC}` less than 0.001 and a weight of 0 to the GC content with :math:`prop^{genome}_{GC}` less than 0.00001. The former threshold is ignored when adding the ``--gcdepthoff`` option.

The summit of the GC content distribution for reads (orange line, GC% = 61 in Fig. 2.1) is important for assessing the GC bias. This score is also outputted in the stats file (e.g., H3K4me3.100.tsv).


GC normalization
+++++++++++++++++++++++++++++++++

When adding the ``--chrdir`` option, the output wig data describes the read distribution normalized by the GC contents, where each read is scaled based on its GC content. However, it should be noted that GC normalization often overcorrects the true read signals. When samples have a different GC distribution compared with other samples, it is preferable to re-prepare them rather than use them with GC normalization.