Spark: cluster computing with working sets. MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

References in zbMATH (referenced in 41 articles )

Showing results 1 to 20 of 41.
Sorted by year (citations)

1 2 3 next

  1. Apishev, M. A.: Effective implementations of topic modeling algorithms (2021)
  2. Becker, Florent; Montealegre, Pedro; Rapaport, Ivan; Todinca, Ioan: The role of randomness in the broadcast congested clique model (2021)
  3. Czumaj, Artur; Davies, Peter; Parter, Merav: Simple, deterministic, constant-round coloring in congested clique and MPC (2021)
  4. Dobriban, Edgar; Sheng, Yue: Distributed linear regression by averaging (2021)
  5. Luthra, Manisha; Koldehofe, Boris; Danger, Niels; Weisenberger, Pascal; Salvaneschi, Guido; Stavrakakis, Ioannis: TCEP: transitions in operator placement to adapt to dynamic network environments (2021)
  6. Zhang, Tonglin; Yang, Baijian: Accounting for factor variables in big data regression (2021)
  7. Ahmadi, Saba; Khuller, Samir; Purohit, Manish; Yang, Sheng: On scheduling coflows (2020)
  8. Czumaj, Artur; Łącki, Jakub; Mądry, Aleksander; Mitrović, Slobodan; Onak, Krzysztof; Sankowski, Piotr: Round compression for parallel matching algorithms (2020)
  9. David B. Dahl: Integration of R and Scala Using rscala (2020) not zbMATH
  10. Fotakis, Dimitris; Milis, Ioannis; Papadigenopoulos, Orestis; Vassalos, Vasilis; Zois, Georgios: Scheduling MapReduce jobs on identical and unrelated processors (2020)
  11. Li, Qi; Zhong, Jiang; Cao, Zehong; Li, Xue: Optimizing streaming graph partitioning via a heuristic greedy method and caching strategy (2020)
  12. Montealegre, P.; Perez-Salazar, S.; Rapaport, I.; Todinca, I.: Graph reconstruction in the congested clique (2020)
  13. Tang, Lu; Zhou, Ling; Song, Peter X.-K.: Distributed simultaneous inference in generalized linear models via confidence distribution (2020)
  14. Zhang, Longxin; Zhou, Liqian; Salah, Ahmad: Efficient scientific workflow scheduling for deadline-constrained parallel tasks in cloud computing environments (2020)
  15. Sánchez, César; Schneider, Gerardo; Ahrendt, Wolfgang; Bartocci, Ezio; Bianculli, Domenico; Colombo, Christian; Falcone, Yliès; Francalanza, Adrian; Krstić, Srđan; Lourenço, João M.; Nickovic, Dejan; Pace, Gordon J.; Rufino, Jose; Signoles, Julien; Traytel, Dmitriy; Weiss, Alexander: A survey of challenges for runtime verification from advanced application domains (beyond software) (2019)
  16. Terenin, Alexander; Dong, Shawfeng; Draper, David: GPU-accelerated Gibbs sampling: a case study of the horseshoe probit model (2019)
  17. Tsamardinos, Ioannis; Borboudakis, Giorgos; Katsogridakis, Pavlos; Pratikakis, Polyvios; Christophides, Vassilis: A greedy feature selection algorithm for big data of high dimensionality (2019)
  18. Yu, Hong; Chen, Yun; Lingras, Pawan; Wang, Guoyin: A three-way cluster ensemble approach for large-scale data (2019)
  19. Bateni, Mohammadhossein; Behnezhad, Soheil; Derakhshan, Mahsa; Hajiaghayi, Mohammadtaghi; Mirrokni, Vahab: Brief announcement: mapreduce algorithms for massive trees (2018)
  20. Condie, Tyson; Das, Ariyam; Interlandi, Matteo; Shkapsky, Alexander; Yang, Mohan; Zaniolo, Carlo: Scaling-up reasoning and advanced analytics on BigData (2018)

1 2 3 next