Open Access
Method
Volum
20al.
et
Zhang
0 8 e 9, Issue 9, Article R137
Model-based Analysis of ChIP-Seq (MACS)
Yong Zhang¤ *, Tao Liu ¤ *, Clifford A Meyer *, J érôm e Eeckhoute †,
David S J ohnson ‡, Bradley E Bernstein §¶ , Chad Nusbaum ¶ ,
Richard M Myers ¥ , Myles Brown †, Wei Li# and X Shirley Liu *
Addresses: *Departm ent of Biostatistics and Com putational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, 44
Binney Street, Boston, MA 0 2115, USA. †Division of Molecular and Cellular Oncology, Departm ent of Medical Oncology, Dana-Farber Cancer
Institute and Departm ent of Medicine, Brigham and Wom en's Hospital and Harvard Medical School, 44 Binney Street, Boston, MA 0 2115, USA.
‡Gene Security Network, Inc., 2686 Middlefield Road, Redwood City, CA 940 63, USA. §Molecular Pathology Unit and Center for Cancer
Research, Massachusetts General Hospital and Departm ent of Pathology, Harvard Medical School, 13th Street, Charlestown, MA 0 2129, USA.
¶ Broad Institute of Harvard and MIT, 7 Cam bridge Center, Cam bridge, MA, 0 2142, USA. ¥ Departm ent of Genetics, Stanford University Medical
Center, Stanford, CA 9430 5, USA. # Division of Biostatistics, Dan L Duncan Cancer Center, Departm ent of Molecular and Cellular Biology,
Baylor College of Medicine, One Baylor Plaza, Houston, TX 770 30 , USA.
¤ These authors contributed equally to this work.
Correspondence: Wei Li. Em ail: wl1@bcm .edu. X Shirley Liu. Em ail: xsliu@jim m y.harvard.edu
Published: 17 September 2008
Genome Biology 2008, 9:R137 (doi:10.1186/gb-2008-9-9-r137)
Received: 4 August 2008
Revised: 3 September 2008
Accepted: 17 September 2008
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2008/9/9/R137
© 2008 Zhang et al.; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
ChIP-Seq analysis
<p>MACS
perform s m odel-based analysis of ChIP-Seq data generated by short read sequencers.</ p>
Abstract
We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short
read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of
ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also
uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for
more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms,
and is freely available.
Background
The determ ination of the 'cistrom e', the genom e-wide set of
in vivo cis-elem ents bound by trans-factors [1], is necessary
to determ ine the genes that are directly regulated by those
trans-factors. Chrom atin im m unoprecipitation (ChIP) [2]
coupled with genom e tiling m icroarrays (ChIP-chip) [3,4]
and sequencing (ChIP-Seq) [5-8] have becom e popular techniques to identify cistrom es. Although early ChIP-Seq efforts
were lim ited by sequencing throughput and cost [2,9], trem endous progress has been achieved in the past year in the
developm ent of next generation m assively parallel sequencing. Tens of millions of short tags (25-50 bases) can now be
sim ultaneously sequenced at less than 1% the cost of tradi-
tional Sanger sequencing methods. Technologies such as Illum ina's Solexa or Applied Biosystem s' SOLiD™ have m ade
ChIP-Seq a practical and potentially superior alternative to
ChIP-chip [5,8 ].
While providing several advantages over ChIP-chip, such as
less starting m aterial, lower cost, and higher peak resolution,
ChIP-Seq also poses challenges (or opportunities) in the analysis of data. First, ChIP-Seq tags represent only the ends of
the ChIP fragm ents, instead of precise protein-DNA binding
sites. Although tag strand inform ation and the approxim ate
distance to the precise binding site could help im prove peak
resolution, a good tag to site distance estim ate is often
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
unknown to the user. Second, ChIP-Seq data exhibit regional
biases along the genom e due to sequencing and m apping
biases, chrom atin structure and genom e copy num ber variations [10 ]. These biases could be m odeled if m atching control
sam ples are sequenced deeply enough. However, am ong the
four recently published ChIP-Seq studies [5-8], one did not
have a control sam ple [5] and only one of the three with control samples system atically used them to guide peak finding
[8]. That m ethod requires peaks to contain significantly
enriched tags in the ChIP sam ple relative to the control,
although a sm all ChIP peak region often contains too few control tags to robustly estim ate the background biases.
only 126 bp (Figure 1a; suggesting a tag shift size of 63 bp),
despite a sonication size (bandw idth) of around 50 0 bp and
Solexa size-selection of around 20 0 bp. Since the FKHR m otif
sequence dictates the precise FoxA1 binding location, the true
distribution of d could be estim ated by aligning the tags by the
FKHR m otif (122 bp; Figure 1b), which gives a sim ilar result
to the MACS m odel. When applied to NRSF and CTCF ChIPSeq, MACS also estimates a reasonable d solely from the tag
distribution: for NRSF ChIP-Seq the MACS m odel estim ated
d as 96 bp compared to the motif estimate of 70 bp; applied to
CTCF ChIP-Seq data the MACS m odel estim ated a d of 76 bp
com pared to the m otif estim ate of 62 bp.
Here, we present Model-based Analysis of ChIP-Seq data,
MACS, which addresses these issues and gives robust and
high resolution ChIP-Seq peak predictions. We conducted
ChIP-Seq of FoxA1 (hepatocyte nuclear factor 3α) in MCF7
cells for com parison with FoxA1 ChIP-chip [1] and identification of features unique to each platform . When applied to
three hum an ChIP-Seq datasets to identify binding sites of
FoxA1 in MCF7 cells, NRSF (neuron-restrictive silencer factor) in J urkat T cells [8], and CTCF (CCCTC-binding factor) in
CD4 + T cells [5] (sum m arized in Table S1 in Additional data
file 1), MACS gives results superior to those produced by
other published ChIP-Seq peak finding algorithm s [8,11,12].
Peak detection
Results
Modeling the shift size of ChIP-Seq tags
ChIP-Seq tags represent the ends of fragm ents in a ChIPDNA library and are often shifted towards the 3' direction to
better represent the precise protein-DNA interaction site. The
size of the shift is, however, often unknown to the experim enter. Since ChIP-DNA fragm ents are equally likely to be
sequenced from both ends, the tag density around a true
binding site should show a bim odal enrichm ent pattern, with
Watson strand tags enriched upstream of binding and Crick
strand tags enriched downstream . MACS takes advantage of
this bim odal pattern to em pirically m odel the shifting size to
better locate the precise binding sites.
Given a sonication size (bandw idth) and a high-confidence
fold-enrichm ent (m fold), MACS slides 2bandw idth windows
across the genom e to find regions with tags m ore than m fold
enriched relative to a random tag genom e distribution. MACS
random ly sam ples 1,0 0 0 of these high-quality peaks, separates their Watson and Crick tags, and aligns them by the
m idpoint between their Watson and Crick tag centers (Figure
1a) if the Watson tag center is to the left of the Crick tag
center. The distance between the m odes of the Watson and
Crick peaks in the alignm ent is defined as 'd', and MACS shifts
all the tags by d/ 2 toward the 3' ends to the most likely protein-DNA interaction sites.
When applied to FoxA1 ChIP-Seq, which was sequenced with
3.9 m illion uniquely m apped tags, MACS estim ates the d to be
Volume 9, Issue 9, Article R137
Zhang et al. R137.2
For experim ents with a control, MACS linearly scales the total
control tag count to be the same as the total ChIP tag count.
Som etim es the sam e tag can be sequenced repeatedly, m ore
times than expected from a random genom e-wide tag distribution. Such tags m ight arise from biases during ChIP-DNA
am plification and sequencing library preparation, and are
likely to add noise to the final peak calls. Therefore, MACS
rem oves duplicate tags in excess of what is warranted by the
sequencing depth (binom ial distribution p-value <10 -5). For
exam ple, for the 3.9 m illion FoxA1 ChIP-Seq tags, MACS
allows each genom ic position to contain no m ore than one tag
and rem oves all the redundancies.
With the current genom e coverage of m ost ChIP-Seq experim ents, tag distribution along the genom e could be m odeled
by a Poisson distribution [7]. The advantage of this m odel is
that one param eter, λBG, can capture both the m ean and the
variance of the distribution. After MACS shifts every tag by d/
2, it slides 2d windows across the genom e to find candidate
peaks with a significant tag enrichm ent (Poisson distribution
p-value based on λBG, default 10 -5). Overlapping enriched
peaks are m erged, and each tag position is extended d bases
from its center. The location with the highest fragm ent
pileup, hereafter referred to as the sum m it, is predicted as the
precise binding location.
In the control sam ples, we often observe tag distributions
with local fluctuations and biases. For exam ple, at the FoxA1
candidate peak locations, tag counts are well correlated
between ChIP and control sam ples (Figure 1c,d). Many possible sources for these biases include local chrom atin structure,
DNA am plification and sequencing bias, and genom e copy
num ber variation. Therefore, instead of using a uniform λBG
estim ated from the whole genom e, MACS uses a dynam ic
parameter, λlocal, defined for each candidate peak as:
λlocal = m ax(λBG, [λ1k,] λ5k, λ10 k)
where λ1k, λ5k and λ10 k are λ estim ated from the 1 kb, 5 kb or
10 kb window centered at the peak location in the control
sam ple, or the ChIP-Seq sam ple when a control sam ple is not
available (in which case λ1k is not used). λlocal captures the
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
(a)
Watson tags
Crick tags
0.6
0.5
Watson tags
Crick tags
0.4
0.3
Tag percentage (%)
0.3
0.2
0.0
0.0
0.1
0.2
0.4
0.5
d = 122 bp
0.1
Tag percentage (%)
Zhang et al. R137.3
(b)
d = 126 bp
−300
−200
−100
0
100
200
300
−300
−200
Location with respect to the center of Watson and Crick peaks (bp)
−100
0
100
200
300
Location with respect to FKHR motif (bp)
(d)
0
2.7
2.6
2.5
2.4
2.3
100
200
300
400
500
Average tag number in control / kb
2.8
600
(c)
Control tag number (normalized) / 10 kb
Volume 9, Issue 9, Article R137
0
200
400
600
800
1,000
−20
−10
FoxA1 ChIP−Seq tag number / 10 kb
0
10
20
Distance to FoxA1 peak center (kb)
(e)
(f)
70
Motif occurrence in peak centers for FoxA1 ChIP-Seq
Spatial resolution for FoxA1 ChIP-Seq
MACS
Without local lambda
Without tag shifting
45
35
40
Spatial resolution (nt)
60
55
50
FKHR motif (%)
65
45
MACS
Without local lambda
Without tag shifting
1,000
2,000
3,000
4,000
5,000
6,000
7,000
1,000
Number of FoxA1 binding sites
2,000
3,000
4,000
5,000
6,000
7,000
Number of FoxA1 binding sites
Figuremodel
MACS
1
for FoxA1 ChIP-Seq
MACS model for FoxA1 ChIP-Seq. (a,b) The 5' ends of strand-separated tags from a random sample of 1,000 model peaks, aligned by the center of their
Watson and Crick peaks (a) and by the FKHR motif (b). (c) The tag count in ChIP versus control in 10 kb windows across the genome. Each dot
represents a 10 kb window; red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation.
(d) Tag density profile in control samples around FoxA1 ChIP-Seq peaks. (e,f) MACS improves the motif occurrence in the identified peak centers (e) and
the spatial resolution (f) for FoxA1 ChIP-Seq through tag shifting and λlocal. Peaks are ranked by p-value. The motif occurrence is calculated as the
percentage of peaks with the FKHR motif within 50 bp of the peak summit. The spatial resolution is calculated as the average distance from the summit to
the nearest FKHR motif. Peaks with no FKHR motif within 150 bp of the peak summit are removed from the spatial resolution calculation.
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
influence of local biases, and is robust against occasional low
tag counts at sm all local regions. MACS uses λlocal to calculate
the p-value of each candidate peak and rem oves potential
false positives due to local biases (that is, peaks significantly
under λBG, but not under λlocal). Candidate peaks with p-values below a user-defined threshold p-value (default 10 -5) are
called, and the ratio between the ChIP-Seq tag count and λlocal
is reported as the fold_ enrichm ent.
sam ple, whereas it would reach 41.2% if MACS uses a global
λBG. This im plies that the λlocal is critical for ChIP-Seq studies
when m atching control sam ples are not available [5,9].
For a ChIP-Seq experim ent with controls, MACS em pirically
estim ates the false discovery rate (FDR) for each detected
peak using the sam e procedure em ployed in the previous
ChIP-chip peak finders MAT [13] and MA2C [14]. At each pvalue, MACS uses the sam e param eters to find ChIP peaks
over control and control peaks over ChIP (that is, a sam ple
swap). The em pirical FDR is defined as Num ber of control
peaks / Num ber of ChIP peaks. MACS can also be applied to
differential binding between two conditions by treating one of
the sam ples as the control. Since peaks from either sam ple are
likely to be biologically m eaningful in this case, we cannot use
a sam ple swap to calculate FDR, and the data quality of each
sam ple needs to be evaluated against a real control.
Model evaluation
The two key features of MACS are: em pirical m odeling of 'd'
and tag shifting by d/ 2 to putative protein-DNA interaction
site; and the use of a dynam ic λlocal to capture local biases in
the genom e. To evaluate the effectiveness of tag shifting based
on the MACS m odel d, we com pared the perform ance of
MACS to a sim ilar procedure that uses the original tag positions instead of the shifted tag locations. The effectiveness of
the dynam ic λlocal is assessed by com paring MACS to a procedure that uses a uniform λBG from the genome background.
Figure 1e,f show that both the detection specificity, m easured
by the percentage of predicted peaks with a FKHR m otif
within 50 bp of the peak sum m it, and the spatial resolution,
defined as the average distance from the peak sum m it to the
nearest FKHR m otif, are greatly improved by using tag shifting and the dynam ic λlocal. In addition, FoxA1 is known to
cooperatively interact with estrogen receptor in breast cancer
cells [1,15]. As evidence for this, we also observed enrichm ent
for estrogen receptor elem ents (3.1-fold enriched relative to
genom e m otif occurrence) and its half-site (2.7-fold) [15]
within the center 30 0 bp regions of MACS-detected FoxA1
ChIP-Seq peaks.
λlocal is also effective in capturing the local genom ic bias from
a ChIP sam ple alone when a control is not available. To dem onstrate this, we applied MACS to FoxA1 ChIP-Seq and control data separately. Using the sam e param eters, all the
control peaks are, in theory, false positives, so the FDR can be
em pirically estim ated as Num ber of control peaks / Num ber
of ChIP peaks. To identify 7,0 0 0 peaks, the FDR for MACS is
only 0 .4% when a control is available and λlocal is used. To get
7,0 0 0 peaks when a control is not available, the FDR could
still rem ain low at 3.8% if MACS estim ates λlocal from the ChIP
Volume 9, Issue 9, Article R137
Zhang et al. R137.4
Method comparisons
We com pared MACS with three other publicly available ChIPSeq peak finding m ethods, ChIPSeq Peak Finder [8], FindPeaks [11] and QuEST [12]. To com pare their prediction specificity, we swapped the ChIP and control sam ples, and
calculated the FDR of each algorithm as Num ber of control
peaks / Num ber of ChIP peaks using the sam e param eters for
ChIP and control. For FoxA1 and NRSF ChIP-Seq (an FDR for
CTCF is not available due to the lack of control), MACS consistently gave fewer false positives than the other three m ethods (Figure 2a,b).
Determ ining the percentage of predicted peaks associated
with a m otif within 50 bp of the peak center for FoxA1 and
NRSF ChIP-Seq, we found MACS to give consistently higher
m otif occurrences (Figure 2c,d). Evaluating the average distance from peak center to motif, excluding peaks that have no
m otif within 150 bp of the peak center, we found that MACS
predicts peaks with better spatial resolution in m ost cases
(Figure 2e,f). For CTCF, since QuEST does not run on sam ples without controls, we only com pared MACS to ChIPSeq
Peak Finder and FindPeaks. Again, MACS gave both higher
m otif occurrences within 50 bp of the peak center and better
spatial resolutions than other m ethods (Figure S1 in Additional data file 1). In general, MACS not only found m ore
peaks with fewer false positives, but also provided better
binding resolution to facilitate downstream motif discovery.
Comparison of ChIP-Seq to ChIP-chip
A com parison of FoxA1 ChIP-Seq and ChIP-chip revealed the
peak locations to be fairly consistent with each other (Figure
3a). Not surprisingly, the m ajority of ChIP-Seq peaks under a
FDR of 1% (65.4%) were also detected by ChIP-chip (MAT
[13] cutoff at FDR <1% and fold-enrichm ent >2). Am ong the
rem aining 34.6% ChIP-Seq unique peaks, 1,0 45 (13.3%) were
not tiled or only partially tiled on the arrays due to the array
design. Therefore, only 21.4% of ChIP-Seq peaks are indeed
specific to the sequencing platform . Furtherm ore, ChIP-chip
targets with higher fold-enrichments are more likely to be
reproducibly detected by ChIP-Seq with a higher tag count
(Figure 3b). Meanwhile, although the signals of array probes
at the ChIP-Seq specific peak regions are below the peak-calling cutoff, they show m oderate signal enrichm ents that are
significantly higher than the genom ic background (Wilcoxon
p-value <10 -320 ; Figure 3c). Indeed, 835 out of 1,684 ChIPSeq specific peaks could also be detected in ChIP-chip, when
the less stringent FDR cutoff of 5% is used. Another reason
why peaks detected by ChIP-Seq m ay be undetected by ChIPchip is that ChIP-Seq specific peaks are usually slightly
shorter than sim ilar fold-enrichm ent peaks found by both
ChIP-Seq and ChIP-chip (Figure 3d) and m ay not be detectable on the array due to insufficient probe coverage. On the
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
(a)
5
5
NRSF ChIP-Seq (FDR)
4
MACS
ChIPSeq Peak Finder
FindPeaks
QuEST
0
0
1
1
2
3
FDR (%)
3
2
FDR (%)
4
MACS
ChIPSeq Peak Finder
FindPeaks
QuEST
1,000
2,000
3,000
4,000
5,000
6,000
7,000
Number of FoxA1 binding sites
(c)
500
(d)
2,000
2,500
3,000
100
90
80
NRSE motif (%)
60
50
45
50
60
55
MACS
ChIPSeq Peak Finder
FindPeaks
QuEST
70
70
1,500
NRSF ChIP-Seq (motif occurrence)
MACS
ChIPSeq Peak Finder
FindPeaks
QuEST
65
1,000
Number of NRSF binding sites
FoxA1 ChIP-Seq (motif occurrence)
1,000
2,000
3,000
4,000
5,000
6,000
7,000
500
1,000
Number of FoxA1 binding sites
1,500
2,000
2,500
3,000
Number of NRSF binding sites
(e)
(f)
FoxA1 ChIP-Seq (spatial resolution)
NRSF ChIP-Seq (spatial resolution)
15
20
MACS
ChIPSeq Peak Finder
FindPeaks
QuEST
10
40
Spatial resolution (nt)
45
MACS
ChIPSeq Peak Finder
FindPeaks
QuEST
35
Spatial resolution (nt)
Zhang et al. R137.5
(b)
FoxA1 ChIP-Seq (FDR)
FKHR motif (%)
Volume 9, Issue 9, Article R137
1,000
2,000
3,000
4,000
5,000
6,000
7,000
500
Number of FoxA1 binding sites
1,000
1,500
2,000
2,500
3,000
Number of NRSF binding sites
Figure 2 of MACS with ChIPSeq Peak Finder, FindPeaks and QuEST
Comparison
Comparison of MACS with ChIPSeq Peak Finder, FindPeaks and QuEST. (a-f) Shown is the FDR for FoxA1 (a) and NRSF (b) ChIP-Seq, motif occurrence
within 50 bp of the peak centers for FoxA1 (c) and NRSF (d), and the average distance from the peak center to the nearest motif (peaks with no motif
within 150 bp from peak center are removed) for FoxA1 (e) and NRSF (f).
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
Zhang et al. R137.6
(b)
8,270
150
100
5,151
0
2,729
50
ChIP-chip (FDR 1%)
ChIP-Seq (FDR 1%)
FoxA1 ChIP-Seq tag number
200
(a)
Volume 9, Issue 9, Article R137
0~5
5~10
10~15
15~20
20~25
25~30
FoxA1 ChIP-chip MATscore
(c)
(d)
FoxA1 ChIP−Seq peak width
600
800
Overlap
Specific
200
400
Peak width (nt)
15
10
5
0
Mean MATscore
20
25
1000
MATscore for ChIP−Seq peak regions
-5
-320
P-value <10
P-value: 0.042
0
−5
P-value: 1.54 X 10
Background
Specific
Overlapping
<25
25~50
>50
Fold−enrichment
(e)
(f)
-54
100
ChIP−Seq
ChIP−chip
P-value: 3.06 X 10
Overlap
Specific
-97
60
0
0
20
40
100
FKHR motif (%)
80
P-value: 1.33 X 10
Comparison of motif occurrence
50
Spatial resolution (nt)
150
Comparison of spatial resolution
Top 7,000 peaks
FDR 1% peaks
Figure 3 (see legend on next page)
Genome Biology 2008, 9:R137
ChIP−Seq
ChIP−chip
Background
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
Volume 9, Issue 9, Article R137
Zhang et al. R137.7
Figure 3 (seeofprevious
Comparison
FoxA1 page)
ChIP-Seq and ChIP-chip
Comparison of FoxA1 ChIP-Seq and ChIP-chip. (a) Overlap between the FoxA1 binding sites detected by ChIP-chip (MAT; FDR <1% and fold-enrichment
>2) and ChIP-Seq (MACS; FDR <1%). Shown are the numbers of regions detected by both platforms (that is, having at least 1 bp in common) or unique to
each platform. (b) The distributions of ChIP-Seq tag number and ChIP-chip MATscore [13] for FoxA1 binding sites identified by both platforms. (c)
MATscore distributions of FoxA1 ChIP-chip at ChIP-Seq/chip overlapping peaks, ChIP-Seq unique peaks, and genome background. For each peak, the
mean MATscore for all probes within the 300 bp region centered at the ChIP-Seq peak summit is used. Genome background is based on MATscores of all
array probes in the FoxA1 ChIP-chip data. (d) Width distributions of FoxA1 ChIP-Seq/chip overlapping peaks and ChIP-Seq unique peaks at different foldenrichments (less than 25, 25 to 50, and larger than 50). (e) Spatial resolution for FoxA1 ChIP-chip and ChIP-Seq peaks. The Wilcoxon test was used to
calculate the p-values for (d) and (e). (f) Motif occurrence within the central 200 bp regions for FoxA1 ChIP-Seq/chip overlapping peaks and platform
unique peaks. Error bars showing standard deviation were calculated from random sampling of 500 peaks ten times for each category. Background motif
occurrences are based on 100,000 randomly selected 200 bp regions in the human genome, excluding regions in genome assembly gaps (containing 'N').
other hand, ChIP-chip specific peak regions also have significantly m ore sequencing tags than the genom ic background
(Wilcoxon p-value <10 -320 ; Figure S2 in Additional data file
1), although with current sequencing depth, those regions
cannot be called as peaks.
Com paring the difference between ChIP-chip and ChIP-Seq
peaks, we find that the average peak width from ChIP-chip is
twice as large as that from ChIP-Seq. The average distance
from peak sum m it to m otif is significantly sm aller in ChIPSeq than ChIP-chip (Figure 3e), dem onstrating the superior
resolution of ChIP-Seq. Under the sam e 1% FDR cutoff, the
FKHR m otif occurrence within the central 20 0 bp from ChIPchip or ChIP-Seq specific peaks is comparable with that from
the overlapping peaks (Figure 3f). This suggests that m ost of
the platform -specific peaks are genuine binding sites. A com parison between NRSF ChIP-Seq and ChIP-chip (Figure S3 in
Additional data file 1) yields sim ilar results, although the
overlapping peaks for NRSF are of m uch better quality than
the platform -specific peaks.
Discussion
ChIP-Seq users are often curious as to whether they have
sequenced deep enough to saturate all the binding sites. In
principle, sequencing saturation should be dependent on the
fold-enrichm ent, since higher-fold peaks are saturated earlier
than lower-fold ones. In addition, due to different cost and
throughput considerations, different users m ight be interested in recovering sites at different fold-enrichm ent cutoffs.
Therefore, MACS produces a saturation table to report, at different fold-enrichm ents, the proportion of sites that could
still be detected when using 90 % to 20 % of the tags. Such
tables produced for FoxA1 (3.9 m illion tags) and NRSF (2.2
m illion tags) ChIP-Seq data sets (Figure S4 in Additional data
file 1; CTCF does not have a control to robustly estim ate foldenrichm ent) show that while peaks with over 60 -fold enrichm ent have been saturated, deeper sequencing could still
recover m ore sites less than 40 -fold enriched relative to the
chromatin input DNA. As sequencing technologies im prove
their throughput, researchers are gradually increasing their
sequencing depth, so this question could be revisited in the
future. For now, we leave it up to individual users to m ake an
inform ed decision on whether to sequence m ore based on the
saturation at different fold-enrichm ent levels.
The d m odeled by MACS suggests that som e short read
sequencers such as Solexa m ay preferentially sequence
shorter fragm ents in a ChIP-DNA pool. This m ay contribute
to the superior resolution observed in ChIP-Seq data, especially for activating transcription and epigenetic factors in
open chrom atin. However, for repressive factors targeting
relatively com pact chrom atin, the target regions m ight be
harder to sonicate into the soluble extract. Furtherm ore, in
the resulting ChIP-DNA, the true targets m ay tend to be
longer than the background DNA in open chrom atin, m aking
them unfavorable for size-selection and sequencing. This
im plies that epigenetic m arkers of closed chrom atin m ay be
harder to ChIP, and even harder to ChIP-Seq. To assess this
potential bias, exam ining the histone m ark ChIP-Seq results
from Mikkelsen et al. [7], we find that while the ChIP-Seq efficiency of the active m ark H3K4m e3 rem ains high as pluripotent cells differentiate, that of repressive m arks H3K27m e3
and H3K9m e3 becom es lower with differentiation (Table S2
in Additional data file 1), even though it is likely that there are
m ore targets for these repressive m arks as cells differentiate.
We caution ChIP-Seq users to adopt m easures to com pensate
for this bias when ChIPing repressive m arks, such as m ore
vigorous sonication, size-selecting slightly bigger fragm ents
for library preparation, or sonicating the ChIP-DNA further
between decrosslinking and library preparation.
MACS calculates the FDR based on the num ber of peaks from
control over ChIP that are called at the sam e p-value cutoff.
This FDR estim ate is m ore robust than calculating the FDR
from randomizing tags along the genome. However, we notice
that when tag counts from ChIP and controls are not balanced, the sam ple with m ore tags often gives m ore peaks even
though MACS normalizes the total tag counts between the
two sam ples (Figure S5 in Additional data file 1). While we
await m ore available ChIP-Seq data with deeper coverage to
understand and overcom e this bias, we suggest to ChIP-Seq
users that if they sequence m ore ChIP tags than controls, the
FDR estim ate of their ChIP peaks m ight be overly optim istic.
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
Genome Biology 2008,
Conclusion
m ated sonication size (default 30 0 ); --pvalue for p-value
cutoff to call peaks (default 1e-5); --mfold for high-confidence fold-enrichm ent to find m odel peaks for MACS m odeling (default 32); --diag for generating the table to evaluate
sequence saturation (default off).
As developm ents in sequencing technology popularize ChIPSeq, we propose a novel algorithm, MACS, for its data analysis. MACS offers four im portant utilities for predicting protein-DNA interaction sites from ChIP-Seq. First, MACS
im proves the spatial resolution of the predicted sites by
em pirically m odeling the distance d and shifting tags by d/ 2.
Second, MACS uses a dynam ic λlocal param eter to capture
local biases in the genom e and im proves the robustness and
specificity of the prediction. It is worth noting that in addition
to ChIP-Seq, λlocal can potentially be applied to other high
throughput sequencing applications, such as copy num ber
variation and digital gene expression, to capture regional
biases and estim ate robust fold-enrichm ent. Third, MACS
can be applied to ChIP-Seq experim ents without controls,
and to those with controls with im proved perform ance. Last
but not least, MACS is easy to use and provides detailed inform ation for each peak, such as genom e coordinates, p-value,
FDR, fold_ enrichm ent, and sum m it (peak center).
Materials and methods
Volume 9, Issue 9, Article R137
Zhang et al. R137.8
In addition, the user has the option to shift tags by an arbitrary number (--shiftsize) without the MACS model (-nomodel), to use a global lam bda (--nolambda) to call
peaks, and to show debugging and warning m essages (-verbose). If a user has replicate files for ChIP or control, it is
recom m ended to concatenate all replicates into one input file.
The output includes one BED file containing the peak chrom osom e coordinates, and one xls file containing the genom e
coordinates, sum m it, p-value, fold_ enrichm ent and FDR (if
control is available) of each peak. For FoxA1 ChIP-Seq in
MCF7 cells with 3.9 m illion and 5.2 m illion ChIP and control
tags, respectively, it takes MACS 15 seconds to m odel the
ChIP-DNA size distribution and less than 3 m inutes to detect
peaks on a 2 GHz CPU Linux com puter with 2 GB of RAM.
Figure S6 in Additional data file 1 illustrates the whole process with a flow chart.
Dataset
ChIP-Seq data for three factors, NRSF, CTCF, and FoxA1,
were used in this study. ChIP-chip and ChIP-Seq (2.2 m illion
ChIP and 2.8 m illion control uniquely m apped reads, sim plified as 'tags') data for NRSF in J urkat T cells were obtained
from Gene Expression Om nibus (GSM210 637) and J ohnson
et al. [8], respectively. ChIP-Seq (2.9 m illion ChIP tags) data
for CTCF in CD4 + T cells were derived from Barski et al. [5].
Abbreviations
ChIP, chrom atin im m unoprecipitation; CTCF, CCCTC-binding factor; FDR, false discovery rate; FoxA1, hepatocyte
nuclear factor 3α; MACS, Model-based Analysis of ChIP-Seq
data; NRSF, neuron-restrictive silencer factor.
ChIP-chip data for FoxA1 and controls in MCF7 cells were
previously published [1], and their corresponding ChIP-Seq
data were generated specifically for this study. Around 3 ng
FoxA1 ChIP DNA and 3 ng control DNA were used for library
preparation, each consisting of an equim olar m ixture of DNA
from three independent experim ents. Libraries were prepared as described in [8] using a PCR pream plification step
and size selection for DNA fragm ents between 150 and 40 0
bp. FoxA1 ChIP and control DNA were each sequenced with
two lanes by the Illumina/ Solexa 1G Genome Analyzer, and
yielded 3.9 m illion and 5.2 m illion uniquely m apped tags,
respectively.
Authors' contributions
Software implementation
Additional
Figures
Click
here
S1-S6,
for
data
file
and
fileTables
1
S1 and S2.
S2
MACS is im plem ented in Python and freely available with an
open source Artistic License at [16]. It runs from the com m and line and takes the following param eters: -t for treatm ent file (ChIP tags, this is the ONLY required param eter for
MACS) and -c for control file containing m apped tags; -format for input file form at in BED or ELAND (output) form at
(default BED); --name for name of the run (for example,
FoxA1, default NA); --gsize for m appable genom e size to
calculate λBG from tag count (default 2.7G bp, approxim ately
the m appable hum an genom e size); --tsize for tag size
(default 25); --bw for bandw idth, which is half of the esti-
XSL, WL and YZ conceived the project and wrote the paper.
YZ, TL and CAM designed the algorithm , perform ed the
research and im plem ented the software. J E, DSJ , BEB, CN,
RMM and MB perform ed FoxA1 ChIP-Seq experim ents and
contributed to ideas. All authors read and approved the final
m anuscript.
Additional data files
The following additional data are available. Additional data
file 1 contains supporting Figures S1-S6, and supporting
Tables S1 and S2.
Acknowledgements
We thank Barbara Wold, Ting Wang, Jason Lieb, Sevinc Ercan, Julie
Ahringer, and Peter Park for their comments and insights. We also thank
Jeremy Zhenhua Wu for proof reading the manuscript. The project was
partially funded by NIH grants HG004069, HG004270 and DK074967.
References
1.
Lupien M, Eeckhoute J, Meyer CA, Wang Q, Zhang Y, Li W, Carroll
JS, Liu XS, Brown M: FoxA1 translates epigenetic signatures
into enhancer driven lineage-specific transcription. Cell 2008,
132:958-970.
Genome Biology 2008, 9:R137
http://genomebiology.com/2008/9/9/R137
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Genome Biology 2008,
Kim TH, Ren B: Genome-wide analysis of protein-DNA interactions. Annu Rev Genomics Hum Genet 2006, 7:81-102.
Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO:
Genomic binding sites of the yeast cell-cycle transcription
factors SBF and MBF. Nature 2001, 409:533-538.
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP,
Young RA: Genome-wide location and function of DNA binding proteins. Science 2000, 290:2306-2309.
Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G,
Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129:823-837.
Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T,
Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith
OL, He A, Marra M, Snyder M, Jones S: Genome-wide profiles of
STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods 2007,
4:651-657.
Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G,
Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E,
O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome-wide maps
of chromatin state in pluripotent and lineage-committed
cells. Nature 2007, 448:553-560.
Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007,
316:1497-1502.
Wei CL, Wu Q, Vega VB, Chiu KP, Ng P, Zhang T, Shahab A, Yong
HC, Fu Y, Weng Z, Liu J, Zhao XD, Chew JL, Lee YL, Kuznetsov VA,
Sung WK, Miller LD, Lim B, Liu ET, Yu Q, Ng HH, Ruan Y: A global
map of p53 transcription-factor binding sites in the human
genome. Cell 2006, 124:207-219.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura
D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K,
Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark
C, Yang F, et al.: Global variation in copy number in the human
genome. Nature 2006, 444:444-454.
FindPeaks
[http://www.bcgsc.ca/platform/bioinfo/software/find
peaks]
QuEST [http://mendel.stanford.edu/SidowLab/downloads/quest/]
Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu
XS: Model-based analysis of tiling-arrays for ChIP-chip. Proc
Natl Acad Sci USA 2006, 103:12457-12462.
Song JS, Johnson WE, Zhu X, Zhang X, Li W, Manrai AK, Liu JS, Chen
R, Liu XS: Model-based analysis of two-color arrays (MA2C).
Genome Biol 2007, 8:R178.
Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S,
Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M:
Genome-wide analysis of estrogen receptor binding sites.
Nat Genet 2006, 38:1289-1297.
MACS [http://liulab.dfci.harvard.edu/MACS/]
Genome Biology 2008, 9:R137
Volume 9, Issue 9, Article R137
Zhang et al. R137.9