Data Sharing Made Easier through Programmable Metadata Zhe Zhang IBM Research

!

Remzi Arpaci-Dusseau University of Wisconsin-Madison

How do applications share data today? –Syncing data between storage systems: • Commonly used big data workflow • Slow, stale and strenuous

Cloud analytics cluster

Primary Data: transactions, emails, logs, etc. In-house analytics cluster

!2

How do applications share data today? –Syncing data between storage systems: • Commonly used big data workflow • Slow, stale and strenuous

Cloud analytics cluster

Primary Data: transactions, emails, logs, etc. In-house analytics cluster

!2

– Mounting and using shared storage systems: • Difficult to serve heterogenous workloads • Heavy workload on centralized name nodes

How do applications share data today? –Syncing data between storage systems: • Commonly used big data workflow • Slow, stale and strenuous

– Mounting and using shared storage systems: • Difficult to serve heterogenous workloads • Heavy workload on centralized name nodes

Cloud analytics cluster

Primary Data: transactions, emails, logs, etc. In-house analytics cluster

Observations –Data always written and read through the same storage system (filesystem, DB, etc.) • Metadata updated with writes • Metadata used in reads –Data produced in form A and consumed in form B? • View DB records as a file? • Analyze thousands of local log files as a single text file?

!2

How do applications share data today? –Syncing data between storage systems: • Commonly used big data workflow • Slow, stale and strenuous

– Mounting and using shared storage systems: • Difficult to serve heterogenous workloads • Heavy workload on centralized name nodes

Cloud analytics cluster

Primary Data: transactions, emails, logs, etc. In-house analytics cluster

Observations –Data always written and read through the same storage system (filesystem, DB, etc.) • Metadata updated with writes • Metadata used in reads –Data produced in form A and consumed in form B? • View DB records as a file? • Analyze thousands of local log files as a single text file?

!2

Programming the Metadata segment 1

Source file 1

Logical definition Under the hood

!3

segment 2

segment 3

Source file 2 Source DB table

Programming the Metadata segment 1

Source file 1

segment 2

segment 3

Source file 2

Logical definition Under the hood

Source DB table

Source file 1

length atime mtime … block

length atime mtime … block Source file 2

block …

!3

blk blk

block blk blk

Programming the Metadata segment 1

Source file 1

segment 2

segment 3

Source file 2

Logical definition Under the hood

Source DB table

Source file 1

length

length

atime

atime

mtime … block length atime mtime … block Source file 2

block …

!3

mtime blk blk

… block

block

block blk blk

block block

Programming the Metadata segment 1

Source file 1

segment 2

segment 3

Source file 2

Logical definition Under the hood

Source DB table

Source file 1

length

length

atime

atime

mtime … block length

block

atime

block

mtime … block Source file 2

block …

!3

mtime blk blk blk

… block block

blk blk

block block block

Challenges § API challenge: identification / namespace of source data – How to define a file in VM1 to include a source file in VM2? – Granularity-based source file selection: 1 out of 10 lines of text? – Content-based source file selection: all lines containing certain keyword? – Arbitrary “SELECT * FROM * WHERE *” in source DB tables?

§ Performance challenge: frequent metadata updates

!4

Layers

Example of Liseners

Applications

• Map to destination file if keyword matches • Map every 1 line out of 10 lines of text to destination file

VFS

• Map entire file to destination file • Map every 1MB out of 10MB to destination file

Block storage

• All VFS listeners can be implemented on block layer with a reverse pointer from block to inode

Use Case: Distributed Live Analytics § hadoop dfs -composeFromLocal § Configuration file slave1:/opt/IBM/*/*.log slave2:/var/*.log …

§ Challenges – Informing NameNode of local file size changes – Balancing workload Email Server VM

Email Server VM

New York

Wed Server

San Jose

Wed Server VM

VM

Raleigh Dallas

EmailServerLogs !5

EasternCoastLogs

MapReduce, Stream, etc.

Data Sharing Made Easier through Programmable Metadata

Commonly used big data workflow. • Slow, stale and strenuous. Primary Data: transactions, emails, logs, etc. Cloud analytics cluster. In-house analytics cluster ...

723KB Sizes 2 Downloads 226 Views

Recommend Documents

Digital marketing made (much) easier
Benefits of Google Tag Manager ... instead of marketing technology—so you can run your campaigns when ... Manager can help your business, please.

Digital marketing made (much) easier
tracking, site analytics, remarketing, and more—with just a few clicks; no more waiting ... Tag Manager can make your job easier. It lets you focus on ... Free and easy tag management. Want to focus ... Manager can help your business, please.

Digital marketing made (much) easier.
analytics, remarketing, and more—with just a few clicks; no more waiting weeks (or ... be trademarks of the respective companies with which they are associated.

Digital marketing made (much) easier
To learn more about how Google Tag. Manager can help your business, please visit: www.google.com/tagmanager. Digital marketing made (much) easier.

Digital marketing made (much) easier. services
1. Digital marketing made (much) easier. Benefits of Google Tag Manager. Free and easy tag management. Want to focus on marketing instead of marketing ...

Digital marketing made (much) easier. - Services
Digital marketing made (much) easier. Benefits of Google Tag Manager. Free and easy tag management. Want to focus on marketing instead of marketing technology? ... right time. Quick and easy – Google Tag Manager is designed to let marketers add or

RESOURCE SHARING THROUGH INDEST ...
educational and technical organization serving a worldwide community of mechanical ... journals and Academic Press (Ideal), one of the world's largest providers of ... IEEE / IEE Electronic Library Online (IEL) : The IEEE/IEE Electronic Library ...

Metadata Type System: Integrate Presentation, Data ...
based metadata extraction scripts, or mashups, to collect. 3. Page 3 of 10. dynamicExploratoryBrowsingInterfaces.pdf. dynamicExploratoryBrowsingInterfaces.

[PDF] Download Diagnosis Made Easier, Second ...
... KS 66604 Phone 866 531 7183 Fax 785 235 6531 SAM gov The System for ... without a permanent dwelling such as a house or apartment People who are ...

pdf-1831\navajo-made-easier-a-course-in-conversational ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. pdf-1831\navajo-made-easier-a-course-in-conversational-navajo-by-irvy-w-goossen.pdf. pdf-1831\navajo-made-ea

Manuel-Posada-Intl-Data-BioSpecimens-Sharing-Charter.pdf ...
Manuel-Posada-Intl-Data-BioSpecimens-Sharing-Charter.pdf. Manuel-Posada-Intl-Data-BioSpecimens-Sharing-Charter.pdf. Open. Extract. Open with. Sign In.

Using AutoMed Metadata in Data Warehousing ...
translation may not be necessary if the data cleansing tools to be employed can ..... functionality in the context of a data warehousing project in the bioinformatics ...

New-Membership-Data-Sharing-Agreement.pdf
New-Membership-Data-Sharing-Agreement.pdf. New-Membership-Data-Sharing-Agreement.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying ...