提交 91bbd9f4 编写于 作者: W wizardforcel

1-2

上级 1299bfda
# 1\. Preface
# 1\. 前言
## 1.1\. About
## 1.1\. 关于
### 1.1.1\. About this note
### 1.1.1\. 关于这个笔记
This is a shared repository for [Learning Apache Spark Notes](https://github.com/runawayhorse001/LearningApacheSpark). The PDF version can be downloaded from [HERE](pyspark.pdf). The first version was posted on Github in [ChenFeng](https://mingchen0919.github.io/learning-apache-spark/index.html) ([[Feng2017]](reference.html#feng2017)). This shared repository mainly contains the self-learning and self-teaching notes from Wenqiang during his [IMA Data Science Fellowship](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#). The reader is referred to the repository [https://github.com/runawayhorse001/LearningApacheSpark](https://github.com/runawayhorse001/LearningApacheSpark) for more details about the `dataset` and the `.ipynb` files.
这是[ PySpark 学习手册](https://github.com/runawayhorse001/LearningApacheSpark)的共享仓库。第一个版本发布在 [ChenFeng](https://mingchen0919.github.io/learning-apache-spark/index.html)[[Feng2017]](reference.html#feng2017))的 Github 上。 这个共享仓库主要包含 Wenqiang 在[ IMA 数据科学项目](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#)期间的自学和教学笔记。 读者可以参考仓库 [apachecn/learning-pyspark-zh](https://github.com/apachecn/learning-pyspark-zh),了解`dataset``.ipynb`文件的更多详细信息。
In this repository, I try to use the detailed demo code and examples to show how to use each main functions. If you find your work wasn’t cited in this note, please feel free to let me know.
在此仓库中,我尝试使用详细的演示代码和示例来说明如何使用每个主要函数。如果您发现本文未引用您的作品,请随时告诉我。
Although I am by no means an data mining programming and Big Data expert, I decided that it would be useful for me to share what I learned about PySpark programming in the form of easy tutorials with detailed example. I hope those tutorials will be a valuable tool for your studies.
虽然我绝不是数据挖掘编程和大数据专家,但我决定以简单的教程和详细的例子,分享我对 PySpark 编程的见解。我希望这些教程将成为您学习的宝贵工具。
The tutorials assume that the reader has a preliminary knowledge of programming and Linux. And this document is generated automatically by using [sphinx](http://sphinx.pocoo.org).
教程假设读者具有编程和 Linux 的初步知识。英文文档是使用 [sphinx](http://sphinx.pocoo.org) 自动生成的。
### 1.1.2\. About the authors
### 1.1.2\. 关于作者
* **Wenqiang Feng**
* Data Scientist and PhD in Mathematics
* University of Tennessee at Knoxville
* 数据科学家和数学 PhD
* 田纳西大学,诺克斯维尔
* Email: [von198@gmail.com](mailto:von198%40gmail.com)
* **Biography**
* **自传**
Wenqiang Feng is Data Scientist within DST’s Applied Analytics Group. Dr. Feng’s responsibilities include providing DST clients with access to cutting-edge skills and technologies, including Big Data analytic solutions, advanced analytic and data enhancement techniques and modeling.
Wenqiang Feng 是 DST 应用分析小组的数据科学家。 Feng 博士的职责包括为 DST 客户提供最先进的技术和技术,包括大数据分析解决方案,高级分析和数据增强技术以及建模。
Dr. Feng has deep analytic expertise in data mining, analytic systems, machine learning algorithms, business intelligence, and applying Big Data tools to strategically solve industry problems in a cross-functional business. Before joining DST, Dr. Feng was an IMA Data Science Fellow at The Institute for Mathematics and its Applications (IMA) at the University of Minnesota. While there, he helped startup companies make marketing decisions based on deep predictive analytics.
Feng 博士在数据挖掘,分析系统,机器学习算法,商业智能以及应用大数据工具方面,拥有深厚的分析专业知识,可以战略性地解决跨功能业务中的行业问题。 在加入 DST 之前,Feng 博士是明尼苏达大学数学及其应用研究所(IMA)的数据科学研究员。 在那里,他帮助初创公司根据深度预测分析做出营销决策。
Dr. Feng graduated from University of Tennessee, Knoxville, with Ph.D. in Computational Mathematics and Master’s degree in Statistics. He also holds Master’s degree in Computational Mathematics from Missouri University of Science and Technology (MST) and Master’s degree in Applied Mathematics from the University of Science and Technology of China (USTC).
Feng 博士毕业于田纳西大学,诺克斯维尔,拥有博士学位,和计算数学和统计学硕士学位。 他还拥有密苏里科技大学(MST)计算数学硕士学位和中国科学技术大学(USTC)应用数学硕士学位。
* **Declaration**
* **声明**
The work of Wenqiang Feng was supported by the IMA, while working at IMA. However, any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the IMA, UTK and DST.
在 IMA 工作期间,Feng 的工作得到了 IMA 的支持。 但是,本材料中表达的任何意见,发现,结论或建议均为作者的意见,并不一定反映 IMA,UTK 和 DST 的观点。
## 1.2\. Motivation for this tutorial
## 1.2\. 这个教程的动机
I was motivated by the [IMA Data Science Fellowship](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#) project to learn PySpark. After that I was impressed and attracted by the PySpark. And I foud that:
我受到[ IMA 数据科学项目](https://www.ima.umn.edu/2016-2017/SW1.23-3.10.17#)项目的启发,来学习 PySpark。 之后,我对 PySpark 印象深刻。 我觉得:
> 1. It is no exaggeration to say that Spark is the most powerful Bigdata tool.
> 2. However, I still found that learning Spark was a difficult process. I have to Google it and identify which one is true. And it was hard to find detailed examples which I can easily learned the full process in one file.
> 3. Good sources are expensive for a graduate student.
> 1. 可以毫不夸张地说,Spark 是最强大的大数据工具。
> 2. 但是,我仍然发现学习 Spark 是一个艰难的过程。 我必须搜索并确定哪一个是对的。 很难找到详细的例子,我可以用它轻松地在一个文件中学习完整的过程。
> 3. 对于研究生来说,好的资源是昂贵的。
## 1.3\. Copyright notice and license info
## 1.3\. 版权声明和协议信息
This [Learning Apache Spark with Python](pyspark.pdf) PDF file is supposed to be a free and living document, which is why its source is available online at [https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf). But this document is licensed according to both [MIT License](https://github.com/runawayhorse001/LearningApacheSpark/blob/master/LICENSE) and [Creative Commons Attribution-NonCommercial 2.0 Generic (CC BY-NC 2.0) License](https://creativecommons.org/licenses/by-nc/2.0/legalcode).
本文档中的代码遵循[ MIT 协议](https://github.com/runawayhorse001/LearningApacheSpark/blob/master/LICENSE),文字遵循[ CC BY-NC-SA 4.0 协议](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode)
**When you plan to use, copy, modify, merge, publish, distribute or sublicense, Please see the terms of those licenses for more details and give the corresponding credits to the author**.
**当您计划使用,复制,修改,合并,发布,分发或再授权时,请查看这些协议的条款来获取更多详细信息,并向作者提供相应的署名**
## 1.4\. Acknowledgement
## 1.4\. 致谢
At here, I would like to thank Ming Chen, Jian Sun and Zhongbo Li at the University of Tennessee at Knoxville for the valuable disscussion and thank the generous anonymous authors for providing the detailed solutions and source code on the internet. Without those help, this repository would not have been possible to be made. Wenqiang also would like to thank the [Institute for Mathematics and Its Applications (IMA)](https://www.ima.umn.edu/) at [University of Minnesota, Twin Cities](https://twin-cities.umn.edu/) for support during his IMA Data Scientist Fellow visit.
在此,我要感谢田纳西大学,诺克斯维尔的 Ming Chen,Jian Sun 和 Zhongbo Li 的宝贵讨论,并感谢慷慨的匿名作者在互联网上提供详细的解决方案和源代码。 没有这些帮助,就无法建立这个CAE库。 Wenqiang 还要感谢[明尼苏达大学双子城](https://twin-cities.umn.edu/)[数学及其应用研究所(IMA)](https://www.ima.umn.edu/),在他的 IMA 数据科学家项目期间提供支持。
A special thank you goes to [Dr. Haiping Lu](http://staffwww.dcs.shef.ac.uk/people/H.Lu/), Lecturer in Machine Learning at Department of Computer Science, University of Sheffield, for recommending and heavily using my tutorial in his teaching class and for the valuable suggestions.
特别感谢[ Haiping Lu 博士](http://staffwww.dcs.shef.ac.uk/people/H.Lu/),谢菲尔德大学计算机科学系机器学习讲师,在他的教学中推荐和大量使用我的教程,并提出了有价值的建议。
## 1.5\. Feedback and suggestions
## 1.5\. 反馈和建议
Your comments and suggestions are highly appreciated. I am more than happy to receive corrections, suggestions or feedbacks through email ([von198@gmail.com](mailto:von198%40gmail.com)) for improvements.
\ No newline at end of file
非常感谢您的意见和建议。 我非常乐意通过电子邮件([von198@gmail.com](mailto:von198%40gmail.com))收到更正,建议或反馈,以便进行改进。
# 2\. Why Spark with Python ?
# 2\. 为什么是 Spark 和 Python
Chinese proverb
> **磨刀不误砍柴工。** – 中国古代谚语
**Sharpening the knife longer can make it easier to hack the firewood** – old Chinese proverb
我想从以下两个部分回答这个问题:
I want to answer this question from the following two parts:
## 2.1\. 为什么是 Spark
## 2.1\. Why Spark?
我认为 [Apache Spark™](http://spark.apache.org/) 官网的以下四个主要原因足以说服您使用 Spark。
I think the following four main reasons from [Apache Spark™](http://spark.apache.org/) official website are good enough to convince you to use Spark.
1. 速度
1. Speed
在内存中运行程序比 Hadoop MapReduce 快 100 倍,或者比磁盘上运行快 10 倍。
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark 拥有先进的 DAG 执行引擎,支持非循环数据流和内存计算。
Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/logistic-regression.png](img/72748fa31cb48a5062a2fc7949bd0b45.jpg)
>
> Hadoop 和 Spark 中的逻辑回归
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/logistic-regression.png](img/72748fa31cb48a5062a2fc7949bd0b45.jpg)
>
> Logistic regression in Hadoop and Spark
1. Ease of Use
1. 易于使用
Write applications quickly in Java, Scala, Python, R.
使用 Java,Scala,Python,R 快速编写应用。
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Spark 提供 80 多个高级操作符,可以轻松构建并行应用。 您可以从 Scala,Python 和 R shell 中以交互方式使用它。
2. Generality
2. 通用性
Combine SQL, streaming, and complex analytics.
结合SQL,流式和复杂的分析。
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Spark 支持很多库,包括 SQL 和 DataFrames,用于机器学习的 MLlib,GraphX 和 Spark Streaming。您可以在同一个应用中无缝地组合这些库。
> [![https://runawayhorse001.github.io/LearningApacheSpark/_images/stack.png](img/d3b112475692c0421480c01cd029cf09.jpg)](https://runawayhorse001.github.io/LearningApacheSpark/_images/stack.png)
>
> The Spark stack
> [![https://runawayhorse001.github.io/LearningApacheSpark/_images/stack.png](img/d3b112475692c0421480c01cd029cf09.jpg)](https://runawayhorse001.github.io/LearningApacheSpark/_images/stack.png)
>
> Spark 技术栈
1. Runs Everywhere
1. 随处运行
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
Spark 在 Hadoop,Mesos,独立或云端运行。 它可以访问各种数据源,包括 HDFS,Cassandra,HBase 和 S3。
> [![https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-runs-everywhere.png](img/b9eb842264e6a48a42ecf5f142e32414.jpg)](https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-runs-everywhere.png)
>
> The Spark platform
> [![https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-runs-everywhere.png](img/b9eb842264e6a48a42ecf5f142e32414.jpg)](https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-runs-everywhere.png)
>
> Spark 平台
## 2.2\. Why Spark with Python (PySpark)?
## 2.2\. 为什么是 PySpark?
No matter you like it or not, Python has been one of the most popular programming languages.
无论你喜欢与否,Python 都是最受欢迎的编程语言之一。
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/languages.jpg](img/348c0d7bc8db0d630042e5faffd2d647.jpg)
>
> KDnuggets Analytics/Data Science 2017 Software Poll from [kdnuggets](http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html).
\ No newline at end of file
> KDnuggets 分析/数据科学 2017 软件调查,来自 [kdnuggets](http://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html)。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册