1. Overview
1.1. Subcommands description
cobind is a python package designed to quantify the “overlapping” or “collocation” of genomic intervals.
Subcommand |
Description |
---|---|
Calculate the collocation coefficient (C). |
|
Calculate the Jaccard similarity coefficient (J). |
|
Calculate the Sørensen–Dice coefficient (SD). |
|
Calculate the Szymkiewicz–Simpson coefficient (SS). |
|
Calculate the pointwise mutual information (PMI). |
|
Calculate the normalized pointwise mutual information (NPMI). |
|
Evaluate if two sets of genomic regions are significantly overlapped. |
|
Calculate the covariance of binding intensities between two sets of genomic intervals. |
|
Report the code of Spatial Relation Of Genomic (SROG) regions. |
|
Wrapper function. Calculate C, J, SD, SS, PMI, and NPMI. |
|
Calculate the overall Zscore of C, J, SD, SS, PMI, and NPMI. |
1.2. Usage
Print out all the avaiable subcommands and their descriptions
cobind.py -h
or
cobind.py --help
usage: cobind.py [-h] [-v]
{overlap,jaccard,dice,simpson,pmi,npmi,cooccur,covary,srog,stat,zscore}
...
**cobind: collocation analyses of genomic regions**
positional arguments:
{overlap,jaccard,dice,simpson,pmi,npmi,cooccur,covary,srog,stat,zscore}
Sub-command description:
overlap Calculate the collocation coefficient (C) between two
sets of genomic regions. C = |A and B| /
(|A|*|B|)**0.5
jaccard Calculate the Jaccard similarity coefficient (J)
between two sets of genomic regions. J = |A and B| /
|A or B|
dice Calculate the Sørensen–Dice coefficient (SD) between
two sets of genomic regions. SD = 2*|A and B| / (|A| +
|B|)
simpson Calculate the Szymkiewicz–Simpson coefficient (SS)
between two sets of genomic regions. SS = |A and B| /
min(|A|, |B|)
pmi Calculate the pointwise mutual information (PMI)
between two sets of genomic regions. PMI = log(p(|A
and B|)) - log(p(|A|)) - log(p(|B|))
npmi Calculate the normalized pointwise mutual information
(NPMI) between two sets of genomic regions. NPMI =
log(p(|A|)*p(|B|)) / log(p(|A and B|)) - 1
cooccur Evaluate if two sets of genomic regions are
significantly co-occurred in given background regions.
covary Calculate the covariance (Pearson, Spearman and
Kendall coefficients) of binding intensities between
two sets of genomic regions.
srog Report the code of Spatial Relation Of Genomic (SROG)
regions. SROG codes include
'disjoint','touch','equal','overlap', 'contain',
'within'.
stat Wrapper function. Report basic statistics of genomic
regions, and calculate overlapping measurements
(including "C", "J", "SD", "SS", "PMI", "NPMI"), without
bootstrap resampling or generating peakwise
measurements.
zscore Calculate Z-score of six overlapping measurements
inlcuding ("C", "J", "SD", "SS", "PMI", "NPMI"),
to provide an overall measurement of the
collocation strength.
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Run each subcommand, for example, run the overlap subcommand:
cobind.py overlap -h
or cobind.py overlap --help
usage: cobind.py overlap [-h] [--nameA NAMEA] [--nameB NAMEB] [-n ITER]
[-f SUBSAMPLE] [-b BGSIZE] [-o] [-l log_file] [-d]
input_A.bed input_B.bed
positional arguments:
input_A.bed Genomic regions in BED, BED-like or bigBed format. The
BED-like format includes:'bed3', 'bed4', 'bed6',
'bed12', 'bedgraph', 'narrowpeak', 'broadpeak',
'gappedpeak'. BED and BED-like format can be plain
text, compressed (.gz, .z, .bz, .bz2, .bzip2) or
remote (http://, https://, ftp://) files. Do not
compress BigBed foramt. BigBed file can also be a
remote file.
input_B.bed Genomic regions in BED, BED-like or bigBed format. The
BED-like format includes:'bed3', 'bed4', 'bed6',
'bed12', 'bedgraph', 'narrowpeak', 'broadpeak',
'gappedpeak'. BED and BED-like format can be plain
text, compressed (.gz, .z, .bz, .bz2, .bzip2) or
remote (http://, https://, ftp://) files. Do not
compress BigBed foramt. BigBed file can also be a
remote file.
options:
-h, --help show this help message and exit
--nameA NAMEA Name to represent 1st set of genomic interval. If not
specified (None), the file name ("input_A.bed") will
be used.
--nameB NAMEB Name to represent the 2nd set of genomic interval. If
not specified (None), the file name ("input_B.bed")
will be used.
-n ITER, --ndraws ITER
Times of resampling to estimate confidence intervals.
Set to '0' to turn off resampling. For the resampling
process to work properly, overlapped intervals in each
bed file must be merged. (default: 20)
-f SUBSAMPLE, --fraction SUBSAMPLE
Resampling fraction. (default: 0.75)
-b BGSIZE, --background BGSIZE
The size of the cis-regulatory genomic regions. This
is about 1.4Gb For the human genome. (default:
1400000000)
-o, --save If set, will save peak-wise coefficients to files
("input_A_peakwise_scores.tsv" and
"input_B_peakwise_scores.tsv").
-l log_file, --log log_file
This file is used to save the log information. By
default, if no file is specified (None), the log
information will be printed to the screen.
-d, --debug Print detailed information for debugging.