seqlm:an MDL based method for identifying differentially methylated regions in high density methylation arrays

NIMI-LISANIMI NIMESTE

Kaspar Märtens*a , Raivo Kolde*ac , Kaie Lokk b, Sven Laur a, Jaak Vilo ac a Institute of Computer Science, University of Tartu, Estonia b Institute of Molecular and Cell Biology, University of Tartu, Estonia c Quretec Ltd * These authors contributed equally Contact: [email protected], [email protected] 1. Introduction

4. seqlm method

One of the main goals of large scale methylation studies is to detect differentially methylated loci. In the novel methylation arrays such as Illumina 450K the density of probes is high enough that it is possible to identify differentially methylated regions (DMRs) containing multiple probes from the array. Compared to the standard approach, where the differential methylation is tested on each probe separately, the task is more complex. In addition to measuring differential methylation one has to combine the probes into reasonable regions. Few tools exist for DMR identification from this type of data, but there is no standard approach. We propose a novel method for DMR identification. By fitting linear models within sliding windows of different lengths and choosing the best configuration according to the minimum description length (MDL) principle, we divide the genome into regionswith similar methylation patterns. The resulting regions are scored using a mixed model that takes into account correctly the repeated nature of measurements. Our approach has several advantages. Most importantly, it depends on very few parameters, thus it requires less tuning. The method is implemented as an R package and is available in Github https:// github.com/raivokolde/seqlm.

The seqlm method works in three stages. Stage 1: The genome is divided into smaller pieces based on a genomic m = μ + 0.05*I + ε m = μ + 0.43*I +ε m = μ - 0.73*I + ε distance cutoff. A 1 A Stage 2: In each piece probes are A A 0.5 segmented into regions that have 0.5 A A approximately constant difference A 0 A 0 between the groups of interest. B B −0.5 • In sliding windows with variable sizes B −0.5 B we fit a linear models to the data. −1 B B • For each model we record the B −1 B description length - the amount of bits B B needed to describe the data using the B B model CpG probes in genomic order • Using dynamic programming we find Figure 1: Example of seqlm method of the segmentation that minimizes total segmentation description length Stage 3: We assess the relevance of each segment, by using a mixed model where the classes are a fixed effect and a sample is a random effect. This model takes into account the repeated nature of the consecutive methylation measurements. The segments are ordered by their significance.

A

1.00

● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ●

0.75 0.50

● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●

0.25

● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●

0.00 3162000

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ●

● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

3164000

3166000

B 1.00 0.75 0.50 0.25 1000.00

● ● ● ●



● ●

1.00 0.75 0.50 0.25 0.00

● ● ● ●

● ● ● ●● ●● ●● ●

● ● ● ●

150 100

●● ● ● ● ● ● ● ● ●●

● ● ●

● ● ● ●

● ● ●

● ● ● ●

● ● ●

● ● ● ● ●● ● ●● ● ●● ●



● ● ● ●

● 100 ● ● ● ● ● ●● ● ●● ● ●





● ● ●

● ● ● ● ●

● ● ● ●

200 150

● ● ● ●

150● ● ●

● ● ● ● ● ● ● ●● ● ● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ● ●

200

● ●



250

● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● 300 ● ● ● ● ● ● ● ● ● ●

A

5. Comparison to other methods There are two R packages available in Bioconductor, that tackle the same problem: bumphunter (Jaffe et.al. 2012) and MethyAnalysis. Both take a similar approach to region finding. Basically they smooth the methylation values using a sliding window, test significance per position and combine significant results together. Compared to bumphunter and MethyAnalysis, seqlm has some advantages: • it has very few input parameters, whereas results of other methods depend strongly on the smoothing window size and effect size cutoff (see Figure 3); • it features less ad hoc statistical methodology for ranking the regions and it does not need to perform permutations like bumphunter. Another package IMA tests the differential methylation significance of predetermined regions, such as gene promoters, CpG islands etc. Drawbacks of IMA (Wang et. al. 2012) compared to seqlm are: • borders of DMRs usually do not coincide with genomic annotations, therefore, IMA can create multiple significant regions out of one DMR (see Figure 3); • different genomic annotations can coincide, creating overlapping regions for IMA (see Figure 3), that complicates the interpretation even further. ● ● ● ● ● ● ● ● ● ● ●

350

● ● ●



250 300 Measurement ● Chip 200 250

Measurement ● Chip

● ●

● ● ●

A

Description Length

The usual approach to find differentially methylated loci is sitewise, i.e. a statistical test is applied to each position separately. However, searching for longer regions is for several reasons more natural. • Methylation is regulated in longer regions (Lienert et.al. 2011) • Correlated structure of methylation is apparent in the data (Figure 1A) • Intermediate CpG sites (not measured by array) follow the same pattern that was identified in the array. (Figure 1B) • Statistical tests have more power to detect differences in longer regions. • The number of results is much smaller, but more informative. • The patterns found in longer regions should be biologically more reliable.

A

196.9 188.2 186 184.6 156.9 139.6

Classes

2. Why regions?

Segmentations

Sanger

Tissue



Adipose





350 Sanger 300

350

Other

Figure 1: Example of a genomic region, (A) shows the array measurements, with grey box Tissue Adipose Other indicating a DMR. (B) shows validation results using Sanger sequencing in a smaller part of the DMR. ●



3. Minimum Description Length principle Our approach is based on the Minimum Description Length (MDL) principle. This is an information theoretic criterion for model selection. It is a mathematical manifestation of Occam’s razor. We try to find the simplest model that can accurately describe the data. To do so, we estimate how concisely the data can be described using a certain probabilistic model. The description length of a model can be expressed as a sum: Description Length = Goodness of Fit + Model Complexity where “Goodness of Fit” is given as negative log likelihood and “Model Complexity” is proportional to the number of parameters in the model. By minimizing the description length, we balance the accuracy of the model and its complexity.

Figure 3: An example region, showing the performance of different algorithms

Lienert F, Wirbelauer C, Som I, Dean A, Mohn F, Schübeler D "Identification of genetic elements that autonomously determine DNA methylation states." (2011) Nature Genetics Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, Irizarry RA. “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. “(2012) Int J Epidemiol. Wang D, Yan L, Hu Q, Sucheston LE, Higgins MJ, Ambrosone CB, Johnson CS, Smiraglia DJ, Liu S. “IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data.“ (2012) Bioinformatics

4. seqlm method - GitHub

Correlated structure of methylation is apparent in the data (Figure 1A) ... Statistical tests have more power to detect differences in longer regions. ... “IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data.

501KB Sizes 0 Downloads 378 Views

Recommend Documents

4 - GitHub
Feb 16, 2016 - devspecs/abi386-4.pdf, which describes the ABI for processors compati- ble with ..... cation software. The state of ..... Generally, a shared object is built with a 0 base virtual .... C++ language semantics, but it does provide some b

4 - GitHub
文件,其效果见husttrans-example.pdf。 1 .... ItalicFont={Adobe Kaiti Std}]{Adobe Song Std}. 41 ... 48. \newCJKfontfamily\FANGSONG{Adobe Fangsong Std}. 11 ...

4 - GitHub
Jun 18, 2016 - 用于生成邮箱地址。如\email{[email protected]}会生成如下效果的地址: ...... \email. 450 \def\email#1{. 451. \href{mailto:#1}{\texttt{#1}}. 452 }.

Chapter 4 - GitHub
The mathematics: A kernel is any function K such that for any u, K(u) ≥ 0, ∫ duK(u)=1 and ∫ uK(u)du = 0. • The idea: a kernel is a nice way to take weighted averages. The kernel function gives the .... The “big-Oh” notation means we have

Project 4 - GitHub
Project 4. Vorticity-Stream F\rnction Method. Due: Mon., Dec. 6, 2010 at 6:00 pm. Consider the incompressible laminar flow in the plane channel shown below.

Open Data publishing method framework - GitHub
Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Build. Assess. Iterate. Certification. Publish!

Manhattan Community District 4 - GitHub
Page 1. 1%. 6%. 12%. 15%. 13%. 3%. 28%. 10%. 2%. 3%. 6%.

Bronx Community District 4 - GitHub
4%. 1%. 1 & 2 Family Bldgs. Multifamily Walk-up. Multifamily Elevator. Mixed Use. Commercial. Industrial. Transportation/Utility. Public/Institutional. Open Space. Parking. Vacant. Other. 748. 1,095. 305. 317. 194. 37. 59. 164. 92. 181. 198. 25. M aj

Queens Community District 4 - GitHub
of residents 5 years or older have limited · English proficiency. Queens. 18%. 27%. NYC. 21%. Queens CD 4 of residents have incomes below the NYCgov poverty threshold. See the federal poverty rate here. NYCgov POVERTY MEASURE11. Queens. 30%. 22%. NYC

lecture 4: linear algebra - GitHub
Inverse and determinant. • AX=I and solve with LU (use inv in linalg). • det A=L00. L11. L22 … (note that Uii. =1) times number of row permutations. • Better to compute ln detA=lnL00. +lnL11. +…

Brooklyn Community District 4 - GitHub
30%. NYC. 23%. Brooklyn CD 4. LIMITED ENGLISH PROFICIENCY4 of residents 5 years or older have limited · English proficiency. Brooklyn. 21%. 25%. NYC. 21%. Brooklyn CD 4 of residents have incomes below the NYCgov poverty threshold. See the federal pov

AIFFD Chapter 4 - Recruitment - GitHub
Some sections build on descriptions from previous sections, so each ... setwd("c:/aaaWork/web/fishR/BookVignettes/AIFFD/") ... fact is best illustrated with a two-way frequency table constructed from the two group factor variables with ..... 10. Year

7*4 «%> ""T *v - GitHub
7*4 «%> ""T *v . ..I &uf>

The identifylayeroption specifies which method to use when ... - GitHub
This widget now has full GUI support for App Builder. I Can't Stress ..... field is e for the ooltip is string u can ge in the atic text. , etc. from the lace the pdf.jpg" ...

An introduction to pplex and the Simplex Method - GitHub
Nov 16, 2012 - include: simple command line interface, visualization (two variables), file input in ... program is brought into the following form, called a dictionary in [2]: ζ. = x + ..... [7] http://campuscgi.princeton.edu/~rvdb/JAVA/pivot/simple

2010 Census Tracts within Bronx Community District 4 - GitHub
Note: Census tracts and community district boundaries are not coterminous. Some census tracts span multiple community districts or park areas. For census tract data visit Census FactFinder at maps.nyc.gov/census. Community. District Boundary. 2010 Ce

Pin Configuration for BLHeliSuite 4-Way Interfaces (4w-if) - GitHub
C2CK 3 D9. PB1. C2CK 3 D5. PD5. C2CK 4 D8. PB0. C2CK 4 D6. PD6. C2CK 5 A2. PC2. C2CK 5 D7. PD7. C2CK 6 A3. PC3. C2CK 6 D8. PB0. C2CK 7 A4.

Brian Kim Resume 4-12-15.pages - GitHub
Apr 12, 2015 - -Helped develop MVP Social Network platform ... Seoul, Korea | One of the largest media and publishing companies in Korea. "Most notably he has an insatiable zeal to learn the latest technical skills, technology ... -Passed final Black