This is a shared repository for [Learning Apache Spark Notes](https://github.com/runawayhorse001/LearningApacheSpark). The PDF version can be downloaded from [HERE](pyspark.pdf). The first version was posted on Github in [ChenFeng](https://mingchen0919.github.io/learning-apache-spark/index.html)([[Feng2017]](reference.html#feng2017)). This shared repository mainly contains the self-learning and self-teaching notes from Wenqiang during his [IMA Data Science Fellowship](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#). The reader is referred to the repository [https://github.com/runawayhorse001/LearningApacheSpark](https://github.com/runawayhorse001/LearningApacheSpark) for more details about the `dataset` and the `.ipynb` files.
In this repository, I try to use the detailed demo code and examples to show how to use each main functions. If you find your work wasn’t cited in this note, please feel free to let me know.
Although I am by no means an data mining programming and Big Data expert, I decided that it would be useful for me to share what I learned about PySpark programming in the form of easy tutorials with detailed example. I hope those tutorials will be a valuable tool for your studies.
The tutorials assume that the reader has a preliminary knowledge of programming and Linux. And this document is generated automatically by using [sphinx](http://sphinx.pocoo.org).
教程假设读者具有编程和 Linux 的初步知识。英文文档是使用 [sphinx](http://sphinx.pocoo.org) 自动生成的。
Wenqiang Feng is Data Scientist within DST’s Applied Analytics Group. Dr. Feng’s responsibilities include providing DST clients with access to cutting-edge skills and technologies, including Big Data analytic solutions, advanced analytic and data enhancement techniques and modeling.
Dr. Feng has deep analytic expertise in data mining, analytic systems, machine learning algorithms, business intelligence, and applying Big Data tools to strategically solve industry problems in a cross-functional business. Before joining DST, Dr. Feng was an IMA Data Science Fellow at The Institute for Mathematics and its Applications (IMA) at the University of Minnesota. While there, he helped startup companies make marketing decisions based on deep predictive analytics.
Dr. Feng graduated from University of Tennessee, Knoxville, with Ph.D. in Computational Mathematics and Master’s degree in Statistics. He also holds Master’s degree in Computational Mathematics from Missouri University of Science and Technology (MST) and Master’s degree in Applied Mathematics from the University of Science and Technology of China (USTC).
The work of Wenqiang Feng was supported by the IMA, while working at IMA. However, any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the IMA, UTK and DST.
在 IMA 工作期间,Feng 的工作得到了 IMA 的支持。 但是,本材料中表达的任何意见,发现,结论或建议均为作者的意见,并不一定反映 IMA,UTK 和 DST 的观点。
## 1.2\. Motivation for this tutorial
## 1.2\. 这个教程的动机
I was motivated by the [IMA Data Science Fellowship](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#) project to learn PySpark. After that I was impressed and attracted by the PySpark. And I foud that:
我受到[ IMA 数据科学项目](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#)项目的启发,来学习 PySpark。 之后,我对 PySpark 印象深刻。 我觉得:
> 1. It is no exaggeration to say that Spark is the most powerful Bigdata tool.
> 2. However, I still found that learning Spark was a difficult process. I have to Google it and identify which one is true. And it was hard to find detailed examples which I can easily learned the full process in one file.
> 3. Good sources are expensive for a graduate student.
This [Learning Apache Spark with Python](pyspark.pdf) PDF file is supposed to be a free and living document, which is why its source is available online at [https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf). But this document is licensed according to both [MIT License](https://github.com/runawayhorse001/LearningApacheSpark/blob/master/LICENSE) and [Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) License](https://creativecommons.org/licenses/by-nc/2.0/legalcode).
本文档中的代码遵循[ MIT 协议](https://github.com/runawayhorse001/LearningApacheSpark/blob/master/LICENSE),文字遵循[ CC BY-NC-SA 4.0 协议](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode)。
**When you plan to use, copy, modify, merge, publish, distribute or sublicense, Please see the terms of those licenses for more details and give the corresponding credits to the author**.
At here, I would like to thank Ming Chen, Jian Sun and Zhongbo Li at the University of Tennessee at Knoxville for the valuable disscussion and thank the generous anonymous authors for providing the detailed solutions and source code on the internet. Without those help, this repository would not have been possible to be made. Wenqiang also would like to thank the [Institute for Mathematics and Its Applications (IMA)](https://www.ima.umn.edu/) at [University of Minnesota, Twin Cities](https://twin-cities.umn.edu/) for support during his IMA Data Scientist Fellow visit.
在此,我要感谢田纳西大学,诺克斯维尔的 Ming Chen,Jian Sun 和 Zhongbo Li 的宝贵讨论,并感谢慷慨的匿名作者在互联网上提供详细的解决方案和源代码。 没有这些帮助,就无法建立这个CAE库。 Wenqiang 还要感谢[明尼苏达大学双子城](https://twin-cities.umn.edu/)的[数学及其应用研究所(IMA)](https://www.ima.umn.edu/),在他的 IMA 数据科学家项目期间提供支持。
A special thank you goes to [Dr. Haiping Lu](http://staffwww.dcs.shef.ac.uk/people/H.Lu/), Lecturer in Machine Learning at Department of Computer Science, University of Sheffield, for recommending and heavily using my tutorial in his teaching class and for the valuable suggestions.
特别感谢[ Haiping Lu 博士](http://staffwww.dcs.shef.ac.uk/people/H.Lu/),谢菲尔德大学计算机科学系机器学习讲师,在他的教学中推荐和大量使用我的教程,并提出了有价值的建议。
## 1.5\. Feedback and suggestions
## 1.5\. 反馈和建议
Your comments and suggestions are highly appreciated. I am more than happy to receive corrections, suggestions or feedbacks through email ([von198@gmail.com](mailto:von198%40gmail.com)) for improvements.
Write applications quickly in Java, Scala, Python, R.
使用 Java,Scala,Python,R 快速编写应用。
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Spark 提供 80 多个高级操作符,可以轻松构建并行应用。 您可以从 Scala,Python 和 R shell 中以交互方式使用它。
2.Generality
2.通用性
Combine SQL, streaming, and complex analytics.
结合SQL,流式和复杂的分析。
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.