Differential Evolution based bucket indexed data deduplication for big data storage

Abstract

Focus of this research work is optimizing the deduplication system by adjusting the pertinent factors in content defined chunking (CDC) to identify as the key ingredients by declaring chunk cut-points and efficient fingerprint lookup using bucket based index partitioning. For efficient chunking, proposed Differential Evolution (DE) algorithm based approach is optimized Two Thresholds Two Divisors (TTTD-P) CDC algorithm where significantly it reduces the number of computing operations by using single dynamic optimal parameter divisor D with optimal threshold value exploiting the multi-operations nature of TTTD. Therefore, proposed DE based TTTD-P optimize chunking to maximize chunking throughput with increased deduplication ratio (DR); and bucket indexing approach reduces hash values judgment time to identify and declare redundant chunk about 16 times faster than Rabin CDC, 5 times than Asymmetric Extremum (AE) CDC, 1.6 times than FAST CDC. Experimental results comparative analysis reveal that TTTD-P using fast BUZ rolling hash function with bucket indexing on Hadoop Distributed File System (HDFS) provide a comparatively maximum redundancy detection with higher throughput, higher deduplication ratio, lesser computation time and very low hash values comparison time as being best distributed deduplication for big data storage systems.

Keywords

Big data data deduplication content defined chunking Differential Evolution TTTD HDFS

Get full access to this article

View all access options for this article.

References

Chen

, Mao

and Liu

, Big data: A survey, Mobile Networks and Applications Journal, Springer 19(2) (2014), 171–209.

Manyika

, Chui

, Brown

, Bughin

, Dobbs

, Roxburgh

and Byers

A.H.

, Big data: The next frontier for innovation, competition, and productivity, McKinsey Company (2011), 1–156.

Storn

, Differential Evolution: A simple and efficient heuristic strategy for global optimization over continuous spaces, Journal of Global Optimization 11 (1997), 341–359.

Min

, Yoon

and Won

, Efficient deduplication techniques for modern backup operation, IEEE Trasactions on Computers 60(6) (2011), 824–840.

Gantz

and Reinsel

, The digital universe decade are you ready? IDC White Paper, 2011.

Biggar

, Experiencing data deduplication: Improving efficiency and reducing capacity requirements, White Paper, The Enterprise Strategy Group, 2012.

T.-Y.

, Pan

J.-S.

and Lin

C.-F.

, Improving accessing efficiency of cloud storage using deduplication and feedback schemes, IEEE Systems Journal 8(1) (2014), 208–218.

Muthitacharoen

, Chen

and Mazieres

, A Low-bandwidth Network File System, Proceeding of the 18th ACM Symposium on Operating System Principle (Sosp’01), Chateau Lake Louise, Banff, Canada, 2001, pp. 174–187.

Deepakumara

, Heys

H.M.

and Venkatesan

, FPGA Implementation of MD5 Hash Algorithm, IEEE Canadian Conference on Electrical and Computer Engineering, 2001, pp. 919–924.

10.

Ning

D.Z.Z.

, FPGA Implementation of SHA-1 Algorithm, IEEE 5th International Conference on ASIC, 2003, pp. 1321–1324.

11.

Bhagwat

, Eshghi

, Long

D.D.E.

and Lillibridge

, Extreme Binning: Scalable, Parallel De-duplication for Chunk- based File Backup, Proceeding of 17th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MAsCOTS’2009), 2009, pp. 1–9.

12.

, Tian

, Liu

, Jiang

and Xiao

, AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment, IEEE International Conference on Cluster Computing (CLUSTER), 2013, pp. 112–120.

13.

El-Shimi

, Kalach

, Kumar

, Oltean

, Li

and Sengupta

, Primary Data Deduplication – Large Scale Study and System Design, 2012 USENIX Federated Conferences Conference Week, 2012, pp. 285–296.

14.

Williams

R.N.

, Method for partitioning a block of data into sub-blocks and for storing and communicating such sub-blocks, Patent US5990810 A, 1999.

15.

Kulkarni

, LaVoie

F.D.J.

and Tracey

J.M.

, Redundancy Elimination Within Large Collections of Files, 04 Proceedings of the Annual Conference on USENIX Annual Technical Conference, General Track, 2004, pp. 59–72.

16.

Meister

, Kaiser

, Brinkmann

, Kuhn

, Kunkel

J.M.

and Cortes

, A Study on Data Deduplication in HPC Storage Systems, Conference Proceedings of the ACM/IEEE Conference on High Performance Computing (HPC), 2012.

17.

, An Efficient Data Deduplication Design with Flash-Memory Based Solid State Drive, A Dissertation Submitted to The Faculty of The Graduate School Of The University Of Minnesota, 2012, pp. 1–114.

18.

Bjorner

, Blass

and Gurevich

, Content-dependent chunking for differential compression, the local maximum approach, Journal of Computer and System Sciences, Elsevier 79(3) (2010), 154–203.

19.

, Jin

and Du

D.H.C.

, Frequency Based Chunking for Data De-Duplication, 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 2010, pp. 287–296.

20.

Lkhagvasuren

, So

J.M.

, Lee

J.G.

, Yoo

and Ko

Y.W.

, Byte-index chunking algorithm for data deduplication system, International Journal of Security and Its Applications, SERSC 7(5) (2013), 415–424.

21.

Lkhagvasuren

, So

J.M.

, Lee

J.G.

, Kim

and Ko

Y.W.

, Multi-level byte index chunking mechanism for file synchronization, International Journal of Software Engineering and Its Applications 8(3) (2014), 339–350.

22.

, Zhang

, Mao

and Li

, Leap Based Content Defined Chunking- Theory and Implementation, IEEE 31st Symposium on Mass Storage Systems and Technologies (MSST), 2015, pp. 1–12.

23.

Kruus

, Ungureanu

and Dubnicki

, Bimodal Content Defined Chunking for Backup Streams, Fast 10 Preceding of the 8th USENIX Conference on file and Storage Technologies, USENIX, 2010, pp. 239–252.

24.

Wei

, Zhu

and Li

, Multimodal Content Defined Chunking for Data Deduplication, https://www.researchgate.net/publication/261286019, Research gate 2014.

25.

Zhang

, Jiang

, Feng

, Xia

, Fu

, Huang

and Zhou

, AE: An Asymmetric Extremum Content Defined Chunking Algorithm for Fast and Bandwidth-Efficient Data Deduplication, IEEE Conference on Computer Communications (INFOCOM), 2015, pp. 1337–1345.

26.

Eshghi

and Tang

H.K.

, A framework for analyzing and improving content based chunking Algorithms, Technical Report TR 2005-30, Hewlett-Packard Development Company, http://www.hpl.hp.com/techreports/2005/HPL-2005-30R1.html.

27.

Moh

T.-S.

and Chang

B.C.

, A Running Time Improvement forthe Two Thresholds Two Divisors Algorithm, ACMSE ’10, 2010.

28.

Xia

, Jiang

, Feng

, Tian

, Fu

and Zhou

, Ddelta: A deduplication-inspired fast delta compression approach Performance Evaluation, 15 Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems, 2015, pp. 258–272.

29.

Aggarwal

, Akella

, Anand

, et al., EndRE: An end-system redundancy elimination service for enterprises, In Proceedings of the 7th USENIX conference on Networked Systems Design and Implementation (NSDI’10) (San Jose, CA, USA), USENIX Association, 2010, pp. 14–28.

30.

Cui

, Lal

, Wang

, Dai

and Miao

, QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services, In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (Paris, France), ACM, 2015, pp, 592–603.

31.

Xia

, Zhou

, Jiang

, Feng

, Hua

, Hu

, Zhang

and Liu

, FastCDC: A Fast and Efficient Content-Defined Chunking Approach for Data Deduplication, USENIX Open Access to the Proceedings of USENIX Annual Technical Conference (USENIX ATC ’16), 2016, pp. 101–114.

32.

Broder

A.Z.

, Some applications of Rabin’s fingerprinting method, Sequences II: Methods in Communications, Security and Computer Science, Springer, 1993, pp. 143–152.

33.

Michael

and Rabin, Fingerprinting by random polynomials, Center for Research in Computing Technology.” Aiken Computation Laboratory, Univ, 1981.

34.

Dubnicki

, Kruus

, Lichota

and Ungureanu

, Methods and systems for data management using multiple selection criteria, US Patent App 11/566,122, 2006.

35.

Min

, Yoon

and Won

, Efficient deduplication techniques for modern backup operation, IEEE Transactions on Computers 60(6) (2011), 824–840.

36.

Derviskaraboga

, A simple and Global optimization algorithm for engineering problem: Differential evolution algorithm, Turk Journal Electrical Engineering 12(1) (2004).

37.

An Introduction to Differential Evolution, http://www.maths.uq.edu.au/MASCOS/Multi Agent04/Fleetwood.pdf

38.

Differential Evolution (DE): http://www.dii.unipd.it/~alotto/didattica/ corsi/Elettrotecnicalcomputazionale/DE.pdf

39.

Cortes-Antonio

, Rangel-Gonzalez

, Villa-Vargas

L.A.

, Ramirez-Salinas

M.A.

, Lozano

H.M.

and Batyrshin

, Design and implementation of differential evolution algorithm on FPGA for doublePrecision floating-point representation, Acta Polytechnic Hungarica 11(4) (2014).

40.

Dean

and Ghemawat

, Map-reduce: Simplified Data Processing on Large Clusters, To appear in OSDI, 2004.

41.

Freedb.org, “http://ftp.freedb.org/pub/freedb/”

42.

Apache hadoop 1.2.1, “http://hadoop.apache.org/”

43.

http://www.serve.net/buz/hash.adt/java.000.html

44.

http://www.icsi.berkeley.edu/storn/code.html

45.

Kumar

, Rawat

and Jain

S.C.

, Bucket Based Data Deduplication Technique for Big Data Storage System, IEEE 5th International Conference on Reliability, Infocom Tech and Optimization, 2016.

46.

Kumar

and Kumar

, Improved join operations using ORC in HIVE, Springer Transactions on Information Communication Technology (ICT), Springer 4(2) (2016), 209–215.

47.

http://www.python.org/

48.

Xia

, Jiang

and Feng

, et al., A comprehensive study of the past, present, and future of data deduplication, Proceedings of the IEEE 104(9) (2016), 1681–1710.

49.

Quinlan

and Dorward

, Venti: A New Approach to Archival Storage, in Proc 1st USENIX Conf File Storage Technologies, Monterey, CA, USA, 2002, pp. 89–101.

50.

Kubiatowicz

, Bindel

and Chen

, Oceanstore: An architecture for global-scale persistent storage, ACM Sigplan Notices 35(11) (2000), 190–201.

51.

Hong

, Plantenberg

and Long

, et al., Duplicate Data Elimination in a SAN file system, in Proc 21st IEEE/12th NASA Goddard Conf Mass Storage Syst Technol, Greenbelt, MD, USA, 2004, pp. 301–314.

52.

Zhang

, Wu

and Yang

, Droplet: A Distributed Solution of Data Deduplication, in Proc ACM/IEEE 13th Int Conf Grid Computing, 2012, pp. 114–121.