MARS: Replicating Petabytes over Long Distances

GUUG 2016 Presentation by Thomas Schöbel-Theuer 1

Replicating Petabytes: Agenda

 Long Distances: Block Level vs FS Level  Long Distances: Big Cluster vs Sharding  Use Cases DRBD vs MARS Light  MARS Working Principle  Behaviour at Network Bottlenecks  Multinode Metadata Propagation (Lamport Clock)  Example Scenario with 4 Nodes  Current Status / Future Plans MARS Presentation by Thomas Schöbel-Theuer

Replication at Block Level vs FS Level Userspace Application Layer

Kernelspace

Filesystem Layer

Caching Layer

Block Layer

Apache, PHP, Mail Queues, etc

xfs, ext4, btrfs, zfs, … vs nfs, Ceph, Swift, ... Page Cache, dentry Cache, ...

1:100 reduction LVM, DRBD / MARS

Potential Cut Point A

for Distributed System ~ 25 Operation Types ~ 100.000 Ops / s

Potential Cut Point B

for Distributed System DSM = Distributed Shared Memory => Cache Coherence Problem!

2 Operation Types (r/w) ~ 1.000 Ops / s

Potential Cut Point C for Distributed System ++ replication of VMs for free!

Hardware MARS Presentation by Thomas Schöbel-Theuer

Hardware-RAID, BBU, ...

MARS Presentation by Thomas Schöbel-Theuer

...

O(n^2)

for geo-redundancy

Frontend 999

User 14

User 13

User 12

User 11

User 10

User 9

User 8

User 7

User 6

User 5

User 4

User 3

User 2

User 999999

...

x2

Internal Storage (or FS) Network

...

Storage 999

Frontend 5

Frontend 6

O(n*k)

Frontend 4

Frontend 3

Frontend 2

Frontend 1

Internet

Storage 6

Storage 5

Storage 4

Storage 3

Storage 2

Storage 1

User 1

Scaling Architectures (1): Big Cluster vs Sharding

MARS Presentation by Thomas Schöbel-Theuer

++ local scalability: spare RAID slots, ... Storage + Frontend 6

...

O(n*k)

for geo-redundancy

User 14

User 13

User 12

User 11

User 10

User 9

User 8

User 7

User 6

User 5

User 4

User 3

User 2

User 999999

...

x2

Storage + Frontend 999

+++ big scale out +++

Internet

Storage + Frontend 5

Storage + Frontend 4

Storage + Frontend 3

Storage + Frontend 2

Storage + Frontend 1

User 1

Scaling Architectures (2): Big Cluster vs

Sharding

=> method scales to petabytes

Use Cases DRBD vs MARS Light

DRBD

MARS Light

(GPL)

Application area: Distances: short ( <50 km ) Synchronously Needs reliable network ● ●

“RAID-1 over network” best with crossover cables

Short inconsistencies during re-sync Under pressure: long or even permanent inconsistencies possible Low space overhead

(GPL)

Application area: Distances: any ( >>50 km ) Asynchronously ●

Tolerates unreliable network Anytime consistency ●

no re-sync

Under pressure: no inconsistency ●

possibly at cost of actuality

Needs >= 100GB in /mars/ for transaction logfiles ● ●

MARS Presentation by Thomas Schöbel-Theuer

near-synchronous modes in preparation

dedicated spindle(s) recommended RAID with BBU recommended

MARS Working Principle Multiversion Asynchronous Replicated Storage Datacenter A (primary)

Datacenter B (secondary)

/dev/mars/mydata mars.ko /dev/lvx/mydata

Similar to MySQL replication

/mars/translogfile

MARS Presentation by Thomas Schöbel-Theuer

mars.ko

/mars/translogfile

/dev/lvx/mydata

Network Bottlenecks (1) DRBD network throughput additional throughput needed for re­sync, not possible

DRBD throughput

automatic re­connect

automatic disconnect

wanted application throughput, not possible

Permanently inconsistent!

mirror inconsistency ... MARS Presentation by Thomas Schöbel-Theuer

time

Network Bottlenecks (2) MARS

network throughput

MARS

application throughput, recorded in transaction log

Best possible throughput behaviour at information theoretic limit

time MARS Presentation by Thomas Schöbel-Theuer

Network Bottlenecks (3) MARS flaky throughput limit

network throughput

MARS application throughput

Best possible throughput behaviour

MARS network throughput

corresponding DRBD inconsistency MARS Presentation by Thomas Schöbel-Theuer

Metadata Propagation (1) Problems for ≥ 3 nodes: – simultaneous updates –

races

Host B (secondary)

Host A (primary) Host C (secondary)

Solution: symlink tree + Lamport Clock => next slides MARS Presentation by Thomas Schöbel-Theuer

Metadata Propagation (2) symlink tree = key->value store Originator context encoded in key

/mars/resource-mydata/sizehostA -> oldvalue Host B (secondary)

/mars/resource-mydata/sizehostA -> 1000 Host A (primary)

Anyone knows anything about others But later

MARS Presentation by Thomas Schöbel-Theuer

Host C (secondary) /mars/resource-mydata/sizehostA -> oldvalue

Metadata Propagation (3) Lamport Clock = virtual timestamp Propagation goes never backwards!

/mars/resource-mydata/sizehostA -> veryveryoldvalue Host B (secondary)

/mars/resource-mydata/sizehostA -> 1000 Host A (primary)

Races are compensated Propagation paths play no role

MARS Presentation by Thomas Schöbel-Theuer

Host C (secondary) /mars/resource-mydata/sizehostA -> 1000

Productive Scenario since 02/2014 (1&1 eShop / ePages) Datacenter A

Datacenter B ← georedundancy (BGP) →

AppCluster A1 (primary)

AppCluster B1 (secondary)

room-to-room

room-to-room

AppCluster A2 (secondary)

AppCluster B2 (secondary)

potential data flow actual data flow (in this scenario) MARS Presentation by Thomas Schöbel-Theuer

Current Status  Source / docs at github.com/schoebel/mars mars­manual.pdf ~ 100 pages

 light0.1stable productive on customer data since

02/2014  MARS status Feb 2016: > 1700 servers (shared hosting + databases) > 2x8 Petabyte total ~ 10 billions of inodes in > 3000 xfs instances > 8 millions of operating hours

 Socket Bundling (light0.2beta)

Up to 8 parallel TCP connections per resource easily saturates 1GBit uplink between Karlsruhe/Europe and Lenexa/USA

 WIP-remote-device

/dev/mars/mydata can appear anywhere

 WIP-compatibility:

no kernel prepatch needed anymore currently tested with vanilla kernels 3.2 … 4.4

MARS Presentation by Thomas Schöbel-Theuer

15

Future Plans  md5 checksums on underlying disks  Mass-scale clustering  Database support / near-synchronous modes

 Further challenges: – – –



community revision at LKML planned replace symlink tree with better representation split into 3 parts: ● Generic brick framework ●

XIO / AIO personality (1st citizen)



MARS Light (1st application)

hopefully attractive for other developers!

MARS Presentation by Thomas Schöbel-Theuer

16

Appendix

MARS Presentation by Thomas Schöbel-Theuer

Use Cases DRBD+proxy vs MARS Light

DRBD+proxy

MARS Light

(proprietary)

Application area: Distances: any Aynchronously ●

Buffering in RAM

Unreliable network leads to frequent re-syncs ● ●

RAM buffer gets lost at cost of actuality

Long inconsistencies during re-sync Under pressure: permanent inconsistency possible High memory overhead Difficult scaling to k>2 nodes MARS Presentation by Thomas Schöbel-Theuer

(GPL)

Application area: Distances: any ( >>50 km ) Asynchronously ●

near-synchronous modes in preparation

Tolerates unreliable network Anytime consistency ●

no re-sync

Under pressure: no inconsistency ●

possibly at cost of actuality

Needs >= 100GB in /mars/ for transaction logfiles dedicated spindle(s) recommended ● RAID with BBU recommended  Easy scaling to k>2 nodes ●

DRBD+proxy Architectural Challenge DRBD Host A (primary)

Proxy A'

bitmap A

sector #8

Proxy B'

A != A' possible

(essentially unused)

DRBD Host B (secondary)

data queue path (several GB buffered)

huge RAM buffer

completion path (commit messages)

#8

#8

#8

same sector #8 occurs n times in queue

#8

bitmap B

#8

n times => need log(n) bits for counter => but DRBD bitmap has only 1 bit/sector => workarounds exist, but complicated (e.g. additional dynamic memory) MARS Presentation by Thomas Schöbel-Theuer

MARS Light Data Flow Principle Host A (primary)

Host B (secondary)

/dev/mars/mydata Transaction Logger

nd pe ap

wr ba iteb ck ac gr k i ou n nd

Temporary Memory Buffer ce n a st r i d g- nsfe n o l tra

Logfile Replicator

/mars/resource/mars/resource/dev/lv-x/mydata mydata/log-00001mydata/log-00001hostA hostA MARS Presentation by Thomas Schöbel-Theuer

Logfile Applicator

/dev/lvx/mydata

Framework Architecture

for MARS + future projects External Software, Cluster Managers, etc Userspace Interface marsadm

Framework Application Layer MARS Light, MARS Full, etc

Framework Personalities XIO = eXtended IO ≈ AIO

Generic Brick Layer IOP = Instance Oriented Programming + AOP = Aspect Oriented Programming

MARS Presentation by Thomas Schöbel-Theuer

MARS Light

XIO bricks

MARS Full

future

Strategy bricks Generic Bricks Generic Objects Generic Aspects s

... other future Personalities and their bricks

Bricks, Objects + Aspects (Example)

+ xio_bio_aspect /dev/lv-x/ mydata

+ xio_trans_logger_aspect

xio_bio + xio_aio_aspect

/mars/resou ce/mydata/ log-001

xio_object + xio_if_aspect

xio_if xio_trans _logger

xio_aio

Aspects are automatically attached on the fly MARS Presentation by Thomas Schöbel-Theuer

/dev/mars/ mydata

Appendix: 1&1 Wide Area Network Infrastructure  Global external bandwidth > 285 GBit/s  Peering with biggest internet exchanges on the world  Own metro networks (DWDM) at the 1&1 datacenter locations

® 1&1 Internet AG 2012

23

IO Latencies over loaded Metro Network (1) DRBD red = write latency blue = read latency

Load = ~30.000 IOPS on 50 spindles RAID-6 (7x shared-derived from blkreplay.org) MARS LCA2014 Presentation by Thomas Schöbel-Theuer

IO Latencies over loaded Metro Network (2) MARS

red = write latency blue = read latency

Same load as before, same conditions MARS LCA2014 Presentation by Thomas Schöbel-Theuer

Performance of Socket Bundling Europe↔USA

MARS FROSCON 2015 Presentation by Thomas Schöbel-Theuer

MARS: Replicating Petabytes - IPFS

Long Distances: Big Cluster vs Sharding. □ Use Cases DRBD vs .... marsmanual.pdf ~ 100 pages. □ light0.1stable productive on customer data since. 02/2014.

2MB Sizes 7 Downloads 222 Views

Recommend Documents

Replicating Anomalies
∗Fisher College of Business, The Ohio State University, 820 Fisher Hall, .... NYSE-Amex-NASDAQ universe, but account for 60% of the number of stocks. ... in returns and anomaly variables among microcaps, small stocks, and big .... The anomalies lit

Replicating Anomalies - Ivey Business School
explains the bulk of the anomalies, but still leaves 46 alphas significant (11 with t ≥ 3). Examples include abnormal returns around earnings announcements, operating and discretionary accruals, cash-based .... data mining, and find that top signal

Tumour selectively replicating oncolytic adenovirus expressing tumor ...
Oct 27, 2016 - An agency of the European Union. Telephone +44 ... Send a question via our website www.ema.europa.eu/contact ... EMA/CAT conclusion.

Green-Mars-Mars-Trilogy.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Tumour selectively replicating oncolytic adenovirus expressing tumor ...
Oct 27, 2016 - An agency of the European Union. Telephone +44 ... Send a question via our website www.ema.europa.eu/contact ... EMA/CAT conclusion.

Problems in replicating studies that rely on lexical ...
In the domain of linguistics and other social sciences, using replication as a method of validating research for the acceptance of new theories and knowledge is ...

Mars FIB.pdf
Page 1 of 1. MARS FIB. Dharma Karya Ilmu Budaya. Cipt: Septanty Kurnia Dewi, SP,. Fakultas Ilmu Budaya Universitas Brawijaya. Civitas akademika berdharma dalam Ilmu dan Budaya. Berbudaya nusantara berwawasan dunia. Berdaya karya cipta rasa karsa nan

mars koperasi.PDF
Page 1 of 1. Page 1 of 1. mars koperasi.PDF. mars koperasi.PDF. Open. Extract. Open with. Sign In. Main menu. Displaying mars koperasi.PDF. Page 1 of 1.

Mars IGI.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Mars IGI.pdf.

MARS TEBING TINGGI.pdf
PANCASILA ADALAH FALSAFAHNYA. WAHAI PUTRA DAN PUTRI YANG SEJATI. TINGKATKAN PEMBANGUNAN. Page 1 of 1. MARS TEBING TINGGI.pdf.

bruno mars nothing.pdf
Page 1. Whoops! There was a problem loading more pages. bruno mars nothing.pdf. bruno mars nothing.pdf. Open. Extract. Open with. Sign In. Main menu.

The MaRS Project
At this point of the discussion, what we call processor is a unit able to manage .... it is envisaged to entrust input-output processors with the execution of. MERGE.

mars abita sma.pdf
... was a problem loading more pages. mars abita sma.pdf. mars abita sma.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying mars abita sma.pdf.

mars abita sma.pdf
Page 3 of 4. Page 3 of 4. mars abita sma.pdf. mars abita sma.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying mars abita sma.pdf. Page 1 of 4.

Mars One Freebie.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Mars One ...

Mars One Freebie.pdf
Mars one Colony Recording Sheet name__________________________________. Page 1 of 1. Mars One Freebie.pdf. Mars One Freebie.pdf. Open. Extract.

Mars 2112.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Mars 2112.pdf.