Spark: cluster computing with working sets. MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

References in zbMATH (referenced in 31 articles )

Showing results 1 to 20 of 31.
Sorted by year (citations)

1 2 next

  1. David B. Dahl: Integration of R and Scala Using rscala (2020) not zbMATH
  2. Fotakis, Dimitris; Milis, Ioannis; Papadigenopoulos, Orestis; Vassalos, Vasilis; Zois, Georgios: Scheduling MapReduce jobs on identical and unrelated processors (2020)
  3. Montealegre, P.; Perez-Salazar, S.; Rapaport, I.; Todinca, I.: Graph reconstruction in the congested clique (2020)
  4. Tang, Lu; Zhou, Ling; Song, Peter X.-K.: Distributed simultaneous inference in generalized linear models via confidence distribution (2020)
  5. Sánchez, César; Schneider, Gerardo; Ahrendt, Wolfgang; Bartocci, Ezio; Bianculli, Domenico; Colombo, Christian; Falcone, Yliès; Francalanza, Adrian; Krstić, Srđan; Lourenço, João M.; Nickovic, Dejan; Pace, Gordon J.; Rufino, Jose; Signoles, Julien; Traytel, Dmitriy; Weiss, Alexander: A survey of challenges for runtime verification from advanced application domains (beyond software) (2019)
  6. Terenin, Alexander; Dong, Shawfeng; Draper, David: GPU-accelerated Gibbs sampling: a case study of the horseshoe probit model (2019)
  7. Tsamardinos, Ioannis; Borboudakis, Giorgos; Katsogridakis, Pavlos; Pratikakis, Polyvios; Christophides, Vassilis: A greedy feature selection algorithm for big data of high dimensionality (2019)
  8. Yu, Hong; Chen, Yun; Lingras, Pawan; Wang, Guoyin: A three-way cluster ensemble approach for large-scale data (2019)
  9. Condie, Tyson; Das, Ariyam; Interlandi, Matteo; Shkapsky, Alexander; Yang, Mohan; Zaniolo, Carlo: Scaling-up reasoning and advanced analytics on bigdata (2018)
  10. Haller, Philipp; Miller, Heather; Müller, Normen: A programming model and foundation for lineage-based distributed computation (2018)
  11. Karim, Md. Rezaul; Cochez, Michael; Beyan, Oya Deniz; Ahmed, Chowdhury Farhan; Decker, Stefan: Mining maximal frequent patterns in transactional databases and dynamic data streams: a Spark-based approach (2018)
  12. Law, Jonathan; Wilkinson, Darren J.: Composable models for online Bayesian analysis of streaming data (2018)
  13. Nghiem, Peter P.: Best trade-off point method for efficient resource provisioning in spark (2018)
  14. Pelucchi, Mauro; Psaila, Giuseppe; Toccu, Maurizio: Hadoop vs. Spark: impact on performance of the Hammer query engine for open data corpora (2018)
  15. Wang, Shusen; Gittens, Alex; Mahoney, Michael W.: Sketched ridge regression: optimization perspective, statistical perspective, and model averaging (2018)
  16. Zheng, Wenjie; Bellet, Aurélien; Gallinari, Patrick: A distributed Frank-Wolfe framework for learning low-rank matrices with the trace norm (2018)
  17. Brandt, Jörgen; Reisig, Wolfgang; Leser, Ulf: Computation semantics of the functional scientific workflow language Cuneiform (2017)
  18. Coelho, L.P.: Jug: Software for Parallel Reproducible Computation in Python (2017) not zbMATH
  19. Ferraro Petrillo, Umberto; Guerra, Concettina; Pizzi, Cinzia: A new distributed alignment-free approach to compare whole proteomes (2017)
  20. García, José; Pope, Christopher; Altimiras, Francisco: A distributed (K)-means segmentation algorithm applied to \textitLobesiabotrana recognition (2017)

1 2 next