Supporting Scalable and Distributed Data Subsetting and Aggregation in Large-Scale Seismic Data Analysis

Abstract

The ability to query and process very large, terabyte-scale datasets has become a key step in many scientific and engineering applications. In this paper, we describe the application of two middleware frameworks in an integrated fashion to provide a scalable and efficient system for execution of seismic data analysis on large datasets in a distributed environment. We investigate different strategies for efficient querying of large datasets and parallel implementations of a seismic image reconstruction algorithm. Our results on a state-of-the-art mass storage system coupled with a high-end compute cluster show that our implementation is scalable and can achieve about 2.9 Gigabytes per second data processing rate – about 70% of the maximum 4.2GB/s application-level raw I/O bandwidth of the storage platform.

Keywords

Seismic Data Analysis Data-Driven Applications

Get full access to this article

View all access options for this article.

References

Allcock, B. , Bester, J. , Bresnahan, J. , Chervenak, A. L. , Foster, I. , Kesselman, C. , Meder, S. , Nefedova, V. , Quesnel, D. , and Tuecke, S. 2001. Secure, efficient data transport and replica management for high-performance data-intensive computing . Proceedings of IEEE Mass Storage Conference, April.

Allen, G. , Dramlitsch, T. , Foster, I. , Karonis, N. , Ripeanu, M. , Seidel, E. , and Toonen, B. 2001. Supporting efficient execution in heterogeneous distributed computing environments with Cactus and Globus . Proceedings of the 2001 ACM/IEEE SC01 Conference. ACM Press, Denver, CO, November.

Amiri, K. , Petrou, D. , Ganger, G. R. , and Gibson, G. A. 2000. Dynamic function placement for data-intensive cluster computing . The USENIX Annual Technical Conference, San Diego, CA, June.

Arpaci-Dusseau, R. H. , Anderson, E. , Treuhaft, N. , Culler, D. E. , Hellerstein, J. M. , Patterson, D. A. , and Yelick, K. 1999. Cluster i/o with river: Making the fast case common . IOPADS ‘99: Input/Output for Parallel and Distributed Systems, Atlanta, GA, May.

Asia Pacific BioGrid. http://www.apgrid.org.

Bell, W. H. , Bosio, D. , Hoschek, W. , Kunszt, P. , McCance, G. , and Silander, M. 2002. Project spitfire – towards grid web service databases. http://www.cs.man.ac.uk/grid-db/documents.html.

Bevc, D. 1997. Imaging complex structures with semirecursive kirchhoff migration . Geophysics, 62(2): 577–588 .

Beynon, M. , Chang, C. , Çatalyürek, Ü. , Kurç, T. , Sussman, A. , Andrade, H. , Ferreira, R. , and Saltz, J. 2002. Processing large-scale multidimensional data in parallel and distributed environments . Parallel Computing 28(5): 827–859 .

Beynon, M. D. , Kurç, T. , Çatalyürek, Ü. , Chang, C. , Sussman, A. , and Saltz, J. 2001. Distributed processing of very large datasets with DataCutter . Parallel Computing 27(11): 1457–1478 .

10.

BIRN. Biomedical Informatics Research Network. http://www.nbirn.net.

11.

Bokhari, S. , Rutt, B. , Wyckoff, P. , and Buerger, P. 2004. An evaluation of the osc fastt600 turbo storage pool. Technical Report OSUBMI_TR_2004_n02, The Ohio State University, Department of Biomedical Informatics , September.

12.

Casanova, H. and Dongarra, J. 1998. Applying Netsolve's network-enabled server . IEEE Computational Science & Engineering 5(3): 57–67 .

13.

CCA Forum . Common Component Architecture Forum. http://www.cca-forum.org.

14.

Chang, C. , Kurç, T. , Sussman, A. , and Saltz, J. 2000. Optimizing retrieval and processing of multi-dimensional scientific datasets . Proceedings of the Third Merged IPPS/ SPDP (14th International Parallel Processing Symposium & 11th Symposium on Parallel and Distributed Processing). Los Alamitos, CA: IEEE Computer Society Press, May.

15.

Czajkowski, K. , Kesselman, C. , Fitzgerald, S. , and Foster, I. 2001. Grid information services for distributed resource sharing . HPDC, 10th IEEE International Symposium on High Performance Distributed Computing (HPDC-10 ’01), p. 0181-0181 .

16.

Dai, H. 2002. Parallel processing of prestack kirchhoff time migration on a beowulf cluster . Expanded Abstracts, SEG 72nd Annual Meeting.

17.

DAIS . Data Access and Integration Services. http://www.cs.man.ac.uk/grid-db/documents.html.

18.

ESG . Earth Systems Grid. http://www.earthsystemgrid.org.

19.

EUROGRID. http://www.eurogrid.org/.

20.

Foster, I. , Kesselman, C. , Nick, J. , and Tuecke, S. 2002a. Grid services for distributed system integration . IEEE Computer 36(6): 37–46 .

21.

Foster, I. , Kesselman, C. , Nick, J. M. , and Tuecke, S. 2002b. The physiology of the Grid: An open grid services architecture for distributed systems integration. http://www.globus.org/research/papers/ogsa.pdf.

22.

Frey, J. , Tannenbaum, T. , Foster, I. , Livny, M. , and Tuecke, S. 2001. Condor-G: A computation management agent for multi-institutional grids . Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10). IEEE Press, San Francisco, California, August.

23.

Globus. The Globus Project. http://www.globus.org.

24.

Graham, S. , Simeonov, S. , Boubez, T. , Davis, D. , Daniels, G. , Nakamura, Y. , and Neyama, R. 2002. Building Web Services with Java: Making Sense of XML, SOAP, WSDL, and UDDI. SAMS Publishing Indianapolis , In.

25.

Grimshaw, A. S. , Wulf, W. A. , and the Legion Team. 1997. The Legion vision of a worldwide virtual computer . Communications of the ACM, 40(1): 39–45 .

26.

GriPhyN. Grid Physics Network. http://www.griphyn.org.

27.

Kao, J. C. , Huang, L. J. and Wu, R. S. 1996. Massively parallel computing of 3D prestack depth migration using phase-screen propagators . Expanded Abstracts, SEG 66th Annual Meeting.

28.

Kurç, T. , Çatalyürek, Ü. , Zhang, X. , Saltz, J. , Martino, R. , Wheeler, M. , Peszyñska, M. , Sussman, A. , Hansen, C. , Sen, M. , Seifoullaev, R. , Stoffa, P. , Torres-Verdin, C. , and Parashar, M. 2005. A simulation and data analysis system for large scale, data-driven oil reservoir simulation studies . Concurrency and Computation: Practice and Experience 17(11): 1441–1467 .

29.

MEDIGRID. http://creatis-www.insa-lyon.fr/MEDIGRID/home.html.

30.

Narayanan, S. , Çatalyürek, Ü. , Kurç, T. , Zhang, X. , and Saltz, J. 2003. Applying database support for large scale data driven science in distributed environments . Proceedings of the Fourth International Workshop on Grid Computing (Grid 2003), pp. 141–148 , Phoenix, Arizona, November.

31.

Narayanan, S. , Kurç, T. , Çatalyürek, Ü. , and Saltz, J. 2003. Database support for data-driven scientific applications in the grid . Parallel Processing Letters 13(2): 245–271 .

32.

Oldfield, R. and Kotz, D. 2001. Armada: A parallel file system for computational grids . Proceedings of CCGrid2001: IEEE International Symposium on Cluster Computing and the Grid, Brisbane, Australia, May, Los Alamitos, CA: IEEE Computer Society Press.

33.

Parashar, M. , Klie, H. , Çatalyürek, Ü. , Kurç, T. , Matossian, V. , Saltz, J. , and Wheeler, M. 2004. Application of grid-enabled technologies for solving optimization problems in data-driven reservoir studies . Proceedings of Workshop on Dynamic Data Driven Application Systems (International Conference on Computational Science), pp. 805–812 , June.

34.

Plale, B. and Schwan, K. 2000. dQUOB: Managing large data flows using dynamic embedded queries . IEEE International High Performance Distributed Computing (HPDC), August.

35.

Raman, V. , Narang, I. , Crone, C. , Haas, L. , Malaika, S. , Mukai, T. , Wolfson, D. , and Baru, C. 2002. Data access and management services on grid. http://www.cs.man.ac.uk/griddb/documents.html.

36.

Saltz, J. , et al. 2003. Driving scientific applications by data in distributed environments . Dynamic Data Driven Application Systems Workshop, held jointly with ICCS 2003, Melbourne, Australia, June.

37.

Sato, M. , Nakada, H. , Sekiguchi, S. , Matsuoka, S. , Nagashima, U. , and Takagi, H. 1997. Ninf: a network based information library for a global world-wide computing infrastructure . Proceedings of HPCN'97 (LNCS-1225), pp. 491–502 .

38.

Smith, J. , Gounaris, A. , Watson, P. , Paton, N. W. , Fernandes, A. A. , and Sakellariou, R. 2002. Distributed query processing on the grid. http://www.cs.man.ac.uk/grid-db/documents.html.

39.

Spencer, M. , Ferreira, R. , Beynon, M. , Kurç, T. , Çatalyürek, Ü. , Sussman, A. , and Saltz, J. 2002. Executing multiple pipelined data analysis operations in the Grid . Proceedings of the 2002 ACM/IEEE SC02 Conference, ACM Press, November.

40.

SPIN . Shared Pathology Informatics Network. http://www.sharedpath.org.

41.

SRB . The Storage Resource Broker. http://www.npaci.edu/DICE/SRB/index.html.

42.

Thain, D. , Basney, J. , Son, S. , and Livny, M. 2001a. Kangaroo approach to data movement on the grid . Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10).

43.

Thain, D. , Bent J. , Arpaci-Dusseau, A. , Arpaci-Dusseau, R. , and Livny, M. 2001b. Gathering at the well: Creating communities for grid i/o . Proceedings of Supercomputing 2001, Denver, CO, November.

44.

Vazhkudai, S. , Tuecke, S. , and Foster, I. 2001. Replica selection in the globus data grid. International Workshop on Data Models and Databases on Clusters and the Grid (DataGrid 2001), Los Alamitos, CA: IEEE Computer Society Press .

45.

Wolski, R. , Spring, N. , and Hayes, J. 1999. The network weather service: A distributed resource performance forecasting service for metacomputing . Journal of Future Generation Computing Systems 15(5-6): 757–768 .