Multiple query optimization approach based on hive+ 1

Abstract

To improve the processing efficiency on batch query for MapReduce, a multiple query optimization approach based on Hive+ is proposed to reduce the number of MapReduce tasks on multiple query, decrease the start time of MapReduce task and the overhead of fault tolerance, improve the query efficiency. TPC-H benchmark test set is selected as the use cases to experiment on Hive-0.12. The experiment shows that the processing efficiency of batch query is effectively improved.

Keywords

Hive+multiple query optimization inter-query flow analysis MapReduce

Get full access to this article

View all access options for this article.

References

Thusoo ,

J.S.

Sarma ,

Jain , et al., Hive-a petabyte scale data warehouse using hadoop[C], Data Engineering (ICDE), 2010 IEEE 26th International Conference on IEEE, 2010, pp. 996–1005.

Liu and

Martonosi , Impala: A middleware system for managing autonomic: Parallel sensor systems[C], ACM SIGPLAN Notices ACM38(10) (2003), 107–118.

Olston ,

Reed ,

Srivastava , et al., Pig latin: A not- so-foreign language for data processing[C], Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data ACM, 2008, pp. 1099–1110.

R.S.

Xin ,

Rosen ,

Zaharia , et al., Shark: SQL and rich analytics at scale[C], Proceedings of the 2013 International Conference on Management of data ACM, 2013, pp. 13–24.

Dean and

Ghemawat , MapReduce: Simplified data pro-cessing on large clusters[J], Communications of the ACM51(1) (2008), 107–113.

Shvachko ,

Kuang ,

Radia , et al., The hadoop distributed file system[C],Mass Storage Systems and Tech-nologies, 2010 IEEE 26th Symposium on IEEE, 2010, pp. 1–10.

Zaharia ,

Chowdhury ,

M.J.

Franklin , et al., Spark: Cluster computing with working sets[C], Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, pp. 10–10.

Herodotou and

Babu , Profiling: What-if analysis, and cost-based optimization of MapReduce programs[J], Proc of the VLDB Endowment4(11) (2011), 1111–1122.

Yong Qiang ,

Ru Bao ,

Yin , et al., RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems[C], Data Engineering(ICDE), 2010 IEEE 26th International, pp. 996–1005.

10.

Wu ,

Li ,

Mehrotra , et al., Query optimization for massively parallel data processing[C], Proceedings of the 2nd ACM Symposium on Cloud Computing ACM, 2011, pp. 12–20.

11.

Sandholm and

Lai , MapReduce optimization using regulated dynamic prioritization[C], Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems ACM, 2009, pp. 299–310.

12.

Nykiel ,

Potamias ,

Mishra , et al., MRShare: Sharing across multiple queries in MapReduce[J], Proceedings of the VLDB Endowment3(1-2) (2010), 494–505.

13.

Mahesh and

Suzanne , Detecting common subexpressions for multiple query optimization over loosely-coupled heterogeneous data source[J], Distributed and Parallel Databases34(2) (2016), 119–143.

14.

Thusoo ,

J.S.

Sarma ,

Jain , etal., Hive: An ware-housing solution over a Map-Reduce framework[J], Proceedings of the VLDB Endowment2(2) (2009), 1626–1629.

15.

Sameen and

Faisal , Multi-query optimization in federated databases using evolutionary algorithm, Proceedings of 14th IEEE International Conference on Machine Learning and Applications, iami, FL, USA, 2015, pp. 723–726.

16.

Guoping and

Chee-Yong , Multi-query optimization in MapReduce framework[J], Proceedings of the VLDB Endowment7(3) (2013), 145–156.