MapReduce is a new parallel programming model initially developed for large-scale web content processing. Data analysis meets the issue of how to do calculation over extremely large datasets. The arrival of MapReduce provides a chance to utilize commodity hardware for massively parallel data analysis applications. The translation and optimization from relational algebra operators to MapReduce programs is still an open and dynamic research field. In this paper, we focus on a special type of data analysis query, namely multiple group by query. We first study the communication cost of the MapReduce model, then we give an initial implementation of multiple group by query. We then propose an optimized version which addresses and improves the communication cost issues. Our optimized version shows a better accelerating ability and a better scalability than the other version

References in zbMATH (referenced in 266 articles , 1 standard article )

Showing results 1 to 20 of 266.
Sorted by year (citations)

1 2 3 ... 12 13 14 next

  1. Nanongkai, Danupon; Scquizzato, Michele: Equivalence classes and conditional hardness in massively parallel computations (2022)
  2. Apishev, M. A.: Effective implementations of topic modeling algorithms (2021)
  3. Becker, Florent; Montealegre, Pedro; Rapaport, Ivan; Todinca, Ioan: The role of randomness in the broadcast congested clique model (2021)
  4. Berthold, Michael R.; Fillbrunn, Alexander; Siebes, Arno: Widening: using parallel resources to improve model quality (2021)
  5. Brandt, Sebastian; Fischer, Manuela; Uitto, Jara: Breaking the linear-memory barrier in (\mathsfMPC): fast (\mathsfMIS) on trees with strongly sublinear memory (2021)
  6. Burkhardt, Paul: Graph connectivity in log steps using label propagation (2021)
  7. Czumaj, Artur; Davies, Peter; Parter, Merav: Simple, deterministic, constant-round coloring in congested clique and MPC (2021)
  8. Dobriban, Edgar; Sheng, Yue: Distributed linear regression by averaging (2021)
  9. Hao, Rong-Xia; Tian, Zengxian: The vertex-pancyclicity of data center networks (2021)
  10. Harchol-Balter, Mor: Open problems in queueing theory inspired by datacenter computing (2021)
  11. Kwon, Joon; Lecué, Guillaume; Lerasle, Matthieu: A MOM-based ensemble method for robustness, subsampling and hyperparameter tuning (2021)
  12. Ramon-Cortes, Cristian; Alvarez, Pol; Lordan, Francesc; Alvarez, Javier; Ejarque, Jorge; Badia, Rosa M.: A survey on the distributed computing stack (2021)
  13. Rendell, Lewis J.; Johansen, Adam M.; Lee, Anthony; Whiteley, Nick: Global consensus Monte Carlo (2021)
  14. Zhang, Tonglin; Yang, Baijian: Accounting for factor variables in big data regression (2021)
  15. Ahmadi, Saba; Khuller, Samir; Purohit, Manish; Yang, Sheng: On scheduling coflows (2020)
  16. Audrito, Giorgio; Beal, Jacob; Damiani, Ferruccio; Pianini, Danilo; Viroli, Mirko: Field-based coordination with the share operator (2020)
  17. Czumaj, Artur; Łącki, Jakub; Mądry, Aleksander; Mitrović, Slobodan; Onak, Krzysztof; Sankowski, Piotr: Round compression for parallel matching algorithms (2020)
  18. Fotakis, Dimitris; Milis, Ioannis; Papadigenopoulos, Orestis; Vassalos, Vasilis; Zois, Georgios: Scheduling MapReduce jobs on identical and unrelated processors (2020)
  19. Genuzio, Marco; Ottaviano, Giuseppe; Vigna, Sebastiano: Fast scalable construction of ([compressed] static | minimal perfect hash) functions (2020)
  20. Ketsman, Bas; Albarghouthi, Aws; Koutris, Paraschos: Distribution policies for Datalog (2020)

1 2 3 ... 12 13 14 next