**Good tools are prerequisite to the successful execution of a job.** – old Chinese proverb
一个好的编程平台可以为您节省大量的麻烦和时间。 在这里,我将仅介绍如何安装我最喜欢的编程平台,并且只展示我在 Linux 系统上设置它的最简单的方法。 如果要在其他操作系统上安装,可以通过搜索引擎。 在本节中,您可以学习如何在相应的编程平台和包上设置 Pyspark。
A good programming platform can save you lots of troubles and time. Herein I will only present how to install my favorite programming platform and only show the easiest way which I know to set it up on Linux system. If you want to install on the other operator system, you can Google it. In this section, you may learn how to set up Pyspark on the corresponding programming platform and package.
If you don’t have any experience with Linux or Unix operator system, I would love to recommend you to use Spark on Databricks Community Cloud. Since you do not need to setup the Spark and it’s totally **free** for Community Edition. Please follow the steps listed below.
> You need to save the path which appears at Uploaded to DBFS: /FileStore/tables/05rmhuqv1489687378010/. Since we will use this path to load the dataset.
After finishing the above 5 steps, you are ready to run your Spark code on Databricks Community Cloud. I will run all the following demos on Databricks Community Cloud. Hopefully, when you run the demo code, you will get the following results:
I will strongly recommend you to install [Anaconda](https://www.anaconda.com/download/), since it contains most of the prerequisites and support multiple Operator Systems.
> 注意
>
> 您需要保存`Uploaded to DBFS`中显示的路径: `/FileStore/tables/05rmhuqv1489687378010/`,由于我们会使用这个路径来上传数据集。
1.**Install Python**
1.创建你的笔记本
Go to Ubuntu Software Center and follow the following steps:
Java is used by many other softwares. So it is quite possible that you have already installed it. You can by using the following command in Command Prompt:
Java 被许多其他软件使用。 所以你很可能已经安装了它。 您可以在命令提示符中使用以下命令:
```
```bash
java -version
```
Otherwise, you can follow the steps in [How do I install Java for my Mac?](https://java.com/en/download/help/mac_install.xml) to install java on Mac and use the following command in Command Prompt to install on Ubuntu:
否则,您可以按照[如何为我的 Mac 安装 Java?](https://java.com/en/download/help/mac_install.xml)中的步骤,在 Mac 上安装 java 并在命令提示符中使用以下命令来在 Ubuntu 上安装:
```
```bash
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
```
### 3.2.3\. Install Java SE Runtime Environment
### 3.2.3\. 安装 JRE
I installed ORACLE [Java JDK](http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html).
Python 2.7.13 |Anaconda 4.4.0 (x86_64)| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR,
use setLogLevel(newLevel).
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Installing open source software on Windows is always a nightmare for me. Thanks for Deelesh Mandloi. You can follow the detailed procedures in the blog [Getting Started with PySpark on Windows](http://deelesh.github.io/pyspark-windows.html) to install the Apache Spark™ on your Windows Operator System.
Using Python version 2.7.13 (default, Dec 20 2016 23:05:08)
SparkSession available as 'spark'.
## 3.4\. PySpark With Text Editor or IDE
```
### 3.4.1\. PySpark With Jupyter Notebook
### 3.2.5\. 配置 Spark
After you finishing the above setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), then you should be good to write and run your PySpark Code in Jupyter notebook.
在 Windows 上安装开源软件对我来说总是一场噩梦。 感谢 Deelesh Mandloi。 您可以按照博客[ Windows 上的 PySpark 入门](http://deelesh.github.io/pyspark-windows.html)中的详细步骤,在 Windows 操作系统上安装 Apache Spark™。
After you finishing the above setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), then you should be good to write and run your PySpark Code in Apache Zeppelin.
After you finishing the above setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), then you should be good to use Sublime Text to write your PySpark Code and run your code as a normal python code in Terminal.
完成[在 Mac 和 Ubuntu 上配置 Spark](#setup-up-ubuntu)中的上述设置步骤后,您应该可以使用 Sublime Text 编写 PySpark 代码,并在终端中将代码作为普通的 python 代码运行。
> ```
> python test_pyspark.py
>
> ```
```bash
python test_pyspark.py
```
Then you should get the output results in your terminal.
If you have a correct setup for PySpark, then you will get the following results:
```
```
Using Spark defined in the SPARK_HOME=/Users/dt216661/spark environmental property
如果您有正确设置了 PySpark,那么您将获得以下结果:
Python 3.7.1 (default, Dec 14 2018, 13:28:58)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-02-15 14:08:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040\. Attempting port 4041.
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041\. Attempting port 4042.
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
```
Using Spark defined in the SPARK_HOME=/Users/dt216661/spark environmental property
Using Python version 3.7.1 (default, Dec 14 2018 13:28:58)
SparkSession available as 'spark'.
Python 3.7.1 (default, Dec 14 2018, 13:28:58)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-02-15 14:08:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040\. Attempting port 4041.
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041\. Attempting port 4042.
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
```
Using Python version 3.7.1 (default, Dec 14 2018 13:28:58)
SparkSession available as 'spark'.
1. Setup `pysparkling` with Jupyter notebook
```
Add the following alias to your `bashrc` (Linux systems) or `bash_profile` (Mac system)
1. 使用 Jupyter notebook `pysparkling`
```
alias sparkling="PYSPARK_DRIVER_PYTHON="ipython" PYSPARK_DRIVER_PYTHON_OPTS= "notebook" /~/~/sparkling-water-2.4.5/bin/pysparkling"
将以下别名添加到`bashrc`(Linux 系统)或`bash_profile`(Mac 系统)
```
```bash
alias sparkling="PYSPARK_DRIVER_PYTHON="ipython" PYSPARK_DRIVER_PYTHON_OPTS= "notebook" /~/~/sparkling-water-2.4.5/bin/pysparkling"
1. Open `pysparkling` in terminal
```
```
sparkling
1. 在终端打开`pysparkling`
```
```bash
sparkling
## 3.6\. Set up Spark on Cloud
```
Following the setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), you can set up your own cluster on the cloud, for example AWS, Google Cloud. Actually, for those clouds, they have their own Big Data tool. Yon can run them directly whitout any setting just like Databricks Community Cloud. If you want more details, please feel free to contact with me.
The code for this section is available for download [test_pyspark](static/test_pyspark.py), and the Jupyter notebook can be download from [test_pyspark_ipynb](static/test_pyspark.ipynb).