To remedy this, we focus on reducing the data fragmentation and read amplification of container-based deduplication systems. However, rewrites will decrease the deduplication ratio since more storage space is used to store the duplicate data chunks. To speed up the restore process, data chunk rewrite (a rewrite is to store a duplicate data chunk) schemes have been proposed to effectively improve data chunk locality and reduce the number of container reads for restoring the original data. The data restore process is rather slow due to data fragmentation and read amplification. The data generated by deduplication is persistently stored in data chunks or data containers (a container consisting of a few hundreds or thousands of data chunks). This benchmark can synthetically generate more precise key-value queries that represent the reads and writes of key-value stores to the underlying storage system.ĭata deduplication is an effective way of improving storage space utilization. To address this issue, we propose a key-range based modeling and develop a benchmark that can better emulate the workloads of real-world key-value stores. We further discover that although the widely used key-value benchmark YCSB provides various workload configurations and key-value pair access distribution models, the YCSB-triggered workloads for underlying storage systems are still not close enough to the workloads we collected due to ignorance of key-space localities. These characterizations reveal several interesting findings: first, that the distribution of key and value sizes are highly related to the use cases/applications second, that the accesses to key-value pairs have a good locality and follow certain special patterns and third, that the collected performance metrics show a strong diurnal pattern in the UDB, but not the other two. In this paper, we first present a detailed characterization of workloads from three typical RocksDB production use cases at Facebook: UDB (a MySQL storage layer for social graph data), ZippyDB (a distributed key-value store), and UP2X (a distributed key-value store for AI/ML services). However, studies of characterizing real-world workloads for key-value stores are limited due to the lack of tracing/analyzing tools and the difficulty of collecting traces in operational environments. Persistent key-value stores are widely used as building blocks in today's IT infrastructure for managing and storing large amounts of data. Moreover, the NDF index only requires indices for two backup versions, while the traditional index grows with the number of versions retained. While the rearranging stage introduces overheads, it is more than offset by a nearly-zero overhead GC process. NDF performs duplicate detection against a previous backup Then AVAR rearranges chunks with an offline and iterative algorithm into a compact, sequential layout, which nearly eliminates random I/O during file restores after deduplication.Įvaluation results with five backup datasets demonstrate that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12 × to 2.19 × higher and restore throughputs that are 1.92 × to 10.02 × faster due to the improved data layout. Specifically, we use two key techniques: Neighbor-Duplicate-Focus indexing (NDF) and Across-Version-Aware Reorganization scheme (AVAR) in MFDedup. Furthermore, we present a novel management-friendly deduplication framework, called MFDedup that applies our data layout and maintains locality as much as possible. Investigating the locality issue, we design a method to flatten the hyper-dimensional structured deduplicated data to a 1-dimensional format, which is based on classification of each chunk’s lifecycle, and this creates our proposed data layout. rewriting) or caching data in memory or SSD, but fragmentation continues to lower restore and GC performance. Current research has considered writing duplicates to maintain locality (e.g. This results from the gap between the hyper-dimensional structure of deduplicated data and the sequential nature of many storage devices, and this leads to poor restore and garbage collection (GC) performance. Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |