Handling Imbalance Data in Reduce task of MapReduce in Cloud Environment

Chetana Tukkoji, Seetharam K

Abstract


There is a growing need for an ad-hoc analysis of extremely large data sets, especially at web based companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, over a solution, but are usually prohibitively ex-pensive at this scale. But, most of the people who analyze data are called procedural programmers. The success of the more procedural map-reduce programming model and its associated scalable implementations on commodity hardware (low cost), is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. The map reduce is an effective tool for parallel data processing. One significant issue in practical map reduce application is the data skew. The imbalance of the amount of the data assigned to each tasks to take much longer to finish than the others. Now we need to propose a framework, to solve the data skew problem to reduce side application in the map reduce. It usage a innovative sampling of the data input accurate approximation to the distribution of the intermediate data by sampling only small fraction of the intermediate data. It does not contain the any type of the data to prevent the overlap between the maps and reduce stages.

Full Text:

PDF

References


J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, January 2008.

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:distributed data- parallel programs from sequential building blocks,” in SIGOPS/ EuroSys European Conference on Computer Systems (EuroSys), 2007.

Y. Kwon, M. Balazinska, and B. Howe, et. al“A study of skew in mapreduce applications,” in Proc. of the Open Cirrus Summit, 2011.

V. Poosala and Y. E. Ioannidis, “Estimation of query-result distribution and its application in parallel-join load balancing,” in Proc. of the International Conference on Very Large Data Bases (VLDB), 1996

Ding Xiang-wu; Hu Rui “Research on distributed data skew join algorithm based on VGFR model Conference” 2016.

Tzu-Chi Huang; Kuo-Chih Chu et.al “Smart Partitioning Mechanism for Dealing with Intermediate Data Skew in Reduce Task on Cloud Computing”, in IEEE(AINA), 2017.

M. Jenifer; B. Bharathi et al.“Survey on the solution of data skew in big data” Online International Conference on Green Engineering and Technologies, 2016.

Tzu-Chi Huang; Kuo-Chih Chu et. al “Idempotent Task Cache System for Handling Intermediate Data Skew in MapReduce on Cloud Computing” International Computer Symposium (ICS), 2016

Zhuo Tang; Wen Ma; Kenli Li; Keqin Li "A Data Skew Oriented Reduce Placement Algorithm Based on Sampling” 2016

Qi Chen; Jinyu Yao; Zhen Xiao “LIBRA: Lightweight Data Skew Mitigation in MapReduce” IEEE Transactions on Parallel and Distributed Systems, Volume: 26, Issue: 9, 2015

Maeva Antoine; Fabrice Huet “Dealing with Skewed Data in Structured Overlays Using Variable Hash Functions”15th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2014.

Zhihong Liu; Qi Zhang et al “ DREAMS: Dynamic resource allocation for MapReduce with data skew” 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 2015

Y.Xu and P. Kostamma, “Efficient outer join data skew handling in parallel dbms"Proc. Of the VLDB Endowment,2009




DOI: https://doi.org/10.23956/ijarcsse.v7i11.498

Refbacks

  • There are currently no refbacks.




© International Journals of Advanced Research in Computer Science and Software Engineering (IJARCSSE)| All Rights Reserved | Powered by Advance Academic Publisher.