Log-Structured Cache: Trading HitRate for Storage Performance (and winning) in Mobile Devices Abutalib Aghayev, Peter Desnoyers Northeastern University
1
FAST'12 Best Paper Award: Revisiting Storage for Smartphones ●
●
●
●
●
Belief: 3G is the bottleneck; storage is faster than network. Reality: random I/O is unusable (0.02MiB/s.) Browsing causes random I/O. Average web page has 100+ small objects. Most of these end up in LRU/LFU cache, generating random I/O. 2
Why can't mobile handle random I/O well? ●
●
●
●
●
●
Flash device = NAND flash + controller. Powerful controllers manage flash more efficiently, but they are large, costly, and draw a lot of power. Vendors choose eMMC for its low cost, small size and low power use. eMMC designed for cameras and audio players. It was not designed as a storage for a general purpose OS. Likely to be the choice for mobile devices for some time. 3
What is the controller for? ●
File systems rely on block interface.
●
Flash exports a different interface.
●
Controller runs software (FTL) that adapts flash interface to block interface.
Source: Zoom Tech Info
4
Flash interface and the function of FTL
●
●
●
Pages within a block must be written sequentially. There is no page overwrite. A page must be erased before it can be written to again. FTL simulates overwrites by maintaining a dynamic mapping between sectors and pages. 5
The cost of fixed mapping ●
●
Time to fill the block = 128 * 250 us = 32 ms Bandwidth = 31.25 MiB/s Time to write a page = 128 * 25 us + 2000 us + 128 * 250 us = 37.2 ms Bandwidth = 0.21 MiB/s
Read Page 25 us Write Page 250 us Erase Block 2000 us
6
FTL found in eMMCs - Block Associative Sector Translation (BAST) ●
Two extremes: block mapping and page mapping.
●
Block mapping: low memory usage, bad random I/O.
●
●
Page mapping: high memory usage, handles random writes well. BAST: –
Few page mapped blocks, the rest are block mapped.
–
Good for small number of random writes.
–
Falls back to block mapping as number of random writes grows. 7
Nexus 7 – Sequential and Random raw I/O performance
●
Why care about writes?
●
Writes end up blocking reads.
●
Large sequential writes result in best performance. 8
Log-Structured File System ●
Makes all writes large and sequential.
●
Storage divided into large segments.
●
Writes buffered in memory, written to segment in a single operation. Segments
Storage
9
Log-Structured File System File A
Inode 5
Segment N
Segment N+1 10
Log-Structured File System 5
Inode Map
File A
Inode 5 Segment N
Segment N+1 11
Log-Structured File System File A
Inode 5
5
Inode Map
File A
Inode 5 Segment N
Segment N+1 12
Log-Structured File System Observation: Unlike FS data, cache data is recomputable. Memory We can have large sequential writes and avoid cleaner too. Storage
Segments Segment cleaning is hard, no algorithm works well for Cleaner all workloads. 13
Chromium Cache Interface ●
●
●
●
Two-tiered: cache backend + high-level cache.
// Backend class methods int CreateEntry(const std::string& key, Entry** entry, const CompletionCallback& cb);
Backend provides unreliable object store with eviction logic.
int OpenEntry(const std::string& key, Entry** entry, const CompletionCallback& cb);
High-level cache implements specific semantics. HTTP Cache, Media Cache, HTML5 Application Cache.
// Entry class methods int ReadData(int stream_index, int stream_offset, IOBuffer* buf, int buf_len, const CompletionCallback& cb); int WriteData(int stream_index, int stream_offset, IOBuffer* buf, int buf_len, const CompletionCallback& cb); 14
Log-Structured Cache – Data Layout ●
●
●
●
A large pre-allocated file is divided into fixed-size segments. Segment = entries + summary. No buffering, there is buffer cache for that. Updating index causes random I/O, avoided (almost) via mmap(2).
15
Log-Structured Cache – Operations ●
●
●
●
Create is append. Delete is in-memory operation. Delete is not evict. Update copies over all streams. –
Preserves locality.
–
It is cheap since most objects are small.
–
Large objects stored in separate files.
16
Log-Structured Cache – Segment Cleaning Eviction ●
Segments states: free → current → closed → free.
●
Process segments in FIFO order.
●
Skip segments with open (for reading) entries.
●
Identify entries in the segment (via summary)
●
For each entry key only one is possible: –
not in the index: entry deleted (no-op).
–
exists, address mismatch: entry updated (no-op).
–
exists, address match: evict entry (update index).
17
Evaluation – Data Collection ●
●
●
Users browsed with modified Chromium for 2 months. Cache operations were logged in full detail. Traces used to re-create browsing sessions as accurately as possible.
18
Evaluation – I/O performance ●
●
●
●
●
Sample traces replayed on Nexus 7. Inter-operation delays preserved. 90% of writes complete under 0.5ms vs 6.9ms. Mean write is 0.2ms vs 6.4ms. Max write is 28.4ms vs 3351ms. 19
Evaluation – Hit Rate ●
●
Upper bound established with infinite cache simulation. Worst-case increase in miss-rate is 3%.
Source: http://httparchive.org 20
Quantifying the Impact ●
Mean operation latency:
T op =T read⋅R hit +T fetch⋅(1− R hit ) 8⋅S obj T fetch= B ●
●
●
For mean object size of 4KiB, latencies match (23.61ms) at 0.94 Mbit/s. For the 95th percentile, the threshold is 0.33Mbit/s. At higher bandwidths log-structured cache has lower mean latency. 21
Thank you! Questions?
●
Ongoing alpha-release implementation at: http://src.chromium.org/viewvc/chrome/trunk/src/net/disk_cache/flash/
22