File formats

Input

The input contact map is stored as a hash structure using pair-bins as keys. Then, a query is binned into a corresponding key based on its position. The size of the bin is specified by a user (-in_hic_resol for .hic) or dependents on the input file (bin-contact pair files).

HiC map

a bin-contact pair format

-in_bin

A bin file defines the chromosome, start and end positions of each bin, an example: fly_30k.cbins (D.mel in 30k bin).

cbin   chr from.coord  to.coord
 1     2L  6000        7000
 2     2L  7000        8000
 3     2L  8000        9000
 4     2L  9000        10000
 5     2L  12000       13000
-in_map

A contact map contains Hi-C frequency indexed by bins in text format, an example fly_30k.n_contact.

cbin1   cbin2   expected_count  observed_count
 1   1   0.077080    50
 1   2   0.389912    314
 1   3   0.493750    163
 1   4   0.560505    169
 1   5   0.368884    79
  • expected_count : the expected contact between those two genome locis (bins) according to model. It could be 1 if no model is applied.

  • observed_count : the observed contact between those two genome locis (bins) in Hi-C data

It takes time to parse text format so we develop bin/genBinMap to turn text format (.n_contact) into binary format (.binmap) which could be used in -in_map, an example fly_30k.binmap.

genBinMap [options] -in_ncontact input.n_contact -out_binmap out.binmap
>bin/genBinMap -in_ncontact examples/fly_30k.n_contact -out_binmap examples/fly_30k.binmap
>hicmaptools -in_map examples/fly_30k.binmap -in_bin examples/fly_30k.cbins -bait examples/bait.bed -output baitTest.tsv

.hic format

The . hic format is generated by Juicer. The parser of . hic is adapted from straw.

-in_hic

A .hic file by applying Juicer on HiC fastq of Galaxy training.

-in_hic_norm (optional)

To extract data based on a specified normalization method, NONE, VC, VC_SQRT, or KR (default: NONE).

-in_hic_resol (optional)

To extract data at a specified resolution, e.g., 5000, 10000, or 50000 (default: 10000).

Query file

The query file is in bed format, where the first three columns are enough. Next, the example of each query mode is listed. Although the biology scenarios of examples are mainly based on Drosophila, HiCmapTools could handle other species (i.e., the example of the mouse Tox gene in the bait query model).

  • bait: bait.bed a PRE binding site, Tox_mm10.bed mouse Tox gene

  • local: local.bed a PcG TAD

  • loop: loop.bed gene, Antp

  • pair: pair.bed a pair of insulator binding sites

  • sites: sites.bed a list of insulator binding sites in range 3L:10000000-11000000

  • submap: submap.bed a region contains Antp-BX long range conttact

  • TAD: TAD.bed selected TADs in range 2L:2000000~3000000

Illustration of query modes

Illustration of query modes

Output

There are two output files. You can use the tool tools/visualPermutationTest.R to examine query’s contact frequency aganist the null hypothesis (Shuffle test).

  • output.tsv : the contact frequency of the interested regions

  • sum_* indicates frequency of HiC

  • rand_* indicates frequency of shuffle test

  • divide_* indicates ratio of sum/rand

  • rank_* indicates the rank of HiC among shuffle test. The smaller rank is, the stronger query frequency is (i.e., rank_nor 0.600 = top60%).

index  chrom   start   end     sum_obs sum_exp sum_nor rand_obs        rand_exp        rand_nor        divide_obs      divide_exp      divide_nor      rank_obs        rank_exp        rank_nor
1      2L      594629  595145  47916.000       459.715 2380.531        32618.180       314.679 2525.479        1.469   1.461   0.943   0.100   0.140   0.600
  • output _random .txt : the observed, expected and normalizated contact intensities of the null hypothesis starting from the third row where the second row is the query frequency

random_obs,random_exp,random_nor
47916,459.715,2380.53
19632,158.539,2956.25
57574,448.25,2832.44
7074,60.7897,3029.22
33009,246.588,3311.8