1. Overview

1.1. Subcommands description

cobind is a python package designed to quantify the “overlapping” or “collocation” of genomic intervals.

subcommands provided by cobind

Subcommand

Description

overlap

Calculate the collocation coefficient (C).

jaccard

Calculate the Jaccard similarity coefficient (J).

dice

Calculate the Sørensen–Dice coefficient (SD).

simpson

Calculate the Szymkiewicz–Simpson coefficient (SS).

pmi

Calculate the pointwise mutual information (PMI).

npmi

Calculate the normalized pointwise mutual information (NPMI).

cooccur

Evaluate if two sets of genomic regions are significantly overlapped.

covary

Calculate the covariance of binding intensities between two sets of genomic intervals.

srog

Report the code of Spatial Relation Of Genomic (SROG) regions.

stat

Wrapper function. Calculate C, J, SD, SS, PMI, and NPMI.

zscore

Calculate the overall Zscore of C, J, SD, SS, PMI, and NPMI.

1.2. Usage

Print out all the avaiable subcommands and their descriptions

cobind.py -h or cobind.py --help

usage: cobind.py [-h] [-v]
                 {overlap,jaccard,dice,simpson,pmi,npmi,cooccur,covary,srog,stat,zscore}
                 ...

**cobind: collocation analyses of genomic regions**

positional arguments:
  {overlap,jaccard,dice,simpson,pmi,npmi,cooccur,covary,srog,stat,zscore}
                        Sub-command description:
    overlap             Calculate the collocation coefficient (C) between two
                        sets of genomic regions. C = |A and B| /
                        (|A|*|B|)**0.5
    jaccard             Calculate the Jaccard similarity coefficient (J)
                        between two sets of genomic regions. J = |A and B| /
                        |A or B|
    dice                Calculate the Sørensen–Dice coefficient (SD) between
                        two sets of genomic regions. SD = 2*|A and B| / (|A| +
                        |B|)
    simpson             Calculate the Szymkiewicz–Simpson coefficient (SS)
                        between two sets of genomic regions. SS = |A and B| /
                        min(|A|, |B|)
    pmi                 Calculate the pointwise mutual information (PMI)
                        between two sets of genomic regions. PMI = log(p(|A
                        and B|)) - log(p(|A|)) - log(p(|B|))
    npmi                Calculate the normalized pointwise mutual information
                        (NPMI) between two sets of genomic regions. NPMI =
                        log(p(|A|)*p(|B|)) / log(p(|A and B|)) - 1
    cooccur             Evaluate if two sets of genomic regions are
                        significantly co-occurred in given background regions.
    covary              Calculate the covariance (Pearson, Spearman and
                        Kendall coefficients) of binding intensities between
                        two sets of genomic regions.
    srog                Report the code of Spatial Relation Of Genomic (SROG)
                        regions. SROG codes include
                        'disjoint','touch','equal','overlap', 'contain',
                        'within'.
    stat                Wrapper function. Report basic statistics of genomic
                        regions, and calculate overlapping measurements
                        (including "C", "J", "SD", "SS", "PMI", "NPMI"), without
                        bootstrap resampling or generating peakwise
                        measurements.
    zscore              Calculate Z-score of six overlapping measurements
                        inlcuding ("C", "J", "SD", "SS", "PMI", "NPMI"),
                        to provide an overall measurement of the
                        collocation strength.

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Run each subcommand, for example, run the overlap subcommand:

cobind.py overlap -h or cobind.py overlap --help

usage: cobind.py overlap [-h] [--nameA NAMEA] [--nameB NAMEB] [-n ITER]
                         [-f SUBSAMPLE] [-b BGSIZE] [-o] [-l log_file] [-d]
                         input_A.bed input_B.bed

positional arguments:
  input_A.bed           Genomic regions in BED, BED-like or bigBed format. The
                        BED-like format includes:'bed3', 'bed4', 'bed6',
                        'bed12', 'bedgraph', 'narrowpeak', 'broadpeak',
                        'gappedpeak'. BED and BED-like format can be plain
                        text, compressed (.gz, .z, .bz, .bz2, .bzip2) or
                        remote (http://, https://, ftp://) files. Do not
                        compress BigBed foramt. BigBed file can also be a
                        remote file.
  input_B.bed           Genomic regions in BED, BED-like or bigBed format. The
                        BED-like format includes:'bed3', 'bed4', 'bed6',
                        'bed12', 'bedgraph', 'narrowpeak', 'broadpeak',
                        'gappedpeak'. BED and BED-like format can be plain
                        text, compressed (.gz, .z, .bz, .bz2, .bzip2) or
                        remote (http://, https://, ftp://) files. Do not
                        compress BigBed foramt. BigBed file can also be a
                        remote file.

options:
  -h, --help            show this help message and exit
  --nameA NAMEA         Name to represent 1st set of genomic interval. If not
                        specified (None), the file name ("input_A.bed") will
                        be used.
  --nameB NAMEB         Name to represent the 2nd set of genomic interval. If
                        not specified (None), the file name ("input_B.bed")
                        will be used.
  -n ITER, --ndraws ITER
                        Times of resampling to estimate confidence intervals.
                        Set to '0' to turn off resampling. For the resampling
                        process to work properly, overlapped intervals in each
                        bed file must be merged. (default: 20)
  -f SUBSAMPLE, --fraction SUBSAMPLE
                        Resampling fraction. (default: 0.75)
  -b BGSIZE, --background BGSIZE
                        The size of the cis-regulatory genomic regions. This
                        is about 1.4Gb For the human genome. (default:
                        1400000000)
  -o, --save            If set, will save peak-wise coefficients to files
                        ("input_A_peakwise_scores.tsv" and
                        "input_B_peakwise_scores.tsv").
  -l log_file, --log log_file
                        This file is used to save the log information. By
                        default, if no file is specified (None), the log
                        information will be printed to the screen.
  -d, --debug           Print detailed information for debugging.