提交 6a5a1a7c 编写于 作者: W wizardforcel

3

上级 91bbd9f4
# 3\. Configure Running Platform
# 3\. 配置运行平台
Chinese proverb
> **工欲善其事,必先利其器。** – 中国古代谚语
**Good tools are prerequisite to the successful execution of a job.** – old Chinese proverb
一个好的编程平台可以为您节省大量的麻烦和时间。 在这里,我将仅介绍如何安装我最喜欢的编程平台,并且只展示我在 Linux 系统上设置它的最简单的方法。 如果要在其他操作系统上安装,可以通过搜索引擎。 在本节中,您可以学习如何在相应的编程平台和包上设置 Pyspark。
A good programming platform can save you lots of troubles and time. Herein I will only present how to install my favorite programming platform and only show the easiest way which I know to set it up on Linux system. If you want to install on the other operator system, you can Google it. In this section, you may learn how to set up Pyspark on the corresponding programming platform and package.
## 3.1\. 在 Databricks 社区云上运行
## 3.1\. Run on Databricks Community Cloud
如果您对 Linux 或 Unix 操作系统没有任何经验,我很乐意建议您在 Databricks 社区云上使用 Spark。 因为你不需要设置 Spark,它对于社区版来说完全是免费的**。 请按照下面列出的步骤操作。
If you don’t have any experience with Linux or Unix operator system, I would love to recommend you to use Spark on Databricks Community Cloud. Since you do not need to setup the Spark and it’s totally **free** for Community Edition. Please follow the steps listed below.
1.[https://community.cloud.databricks.com/login.html](https://community.cloud.databricks.com/login.html) 建立账户:
> 1. Sign up a account at: [https://community.cloud.databricks.com/login.html](https://community.cloud.databricks.com/login.html)
>
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/login.png](img/7166a4887b7f211527c9e45a072e23d2.jpg)
>
> 1. Sign in with your account, then you can creat your cluster(machine), table(dataset) and notebook(code).
>
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/workspace.png](img/c9c3087ea25e6c3f848030b33b06de8f.jpg)
>
> 1. Create your cluster where your code will run
>
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/cluster.png](img/fdfe96b0b4fdfbfd862a698dc64ce34a.jpg)
>
> 1. Import your dataset
>
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/table.png](img/b7721ad6f461509452813013157c7a5e.jpg) ![https://runawayhorse001.github.io/LearningApacheSpark/_images/dataset1.png](img/b8c9ccb17235ad37b2b0fee18853efe6.jpg)
>
> Note
>
> You need to save the path which appears at Uploaded to DBFS: /FileStore/tables/05rmhuqv1489687378010/. Since we will use this path to load the dataset.
![https://runawayhorse001.github.io/LearningApacheSpark/_images/login.png](img/7166a4887b7f211527c9e45a072e23d2.jpg)
1. Creat your notebook
1. 使用您的帐户登录,然后您可以创建集群(计算机),表(数据集)和笔记本(代码)。
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/notebook.png](img/edb67528127916e7e274addf9ad96029.jpg) ![https://runawayhorse001.github.io/LearningApacheSpark/_images/codenotebook.png](img/8973b73843e90120de5f556d5084eb49.jpg)
![https://runawayhorse001.github.io/LearningApacheSpark/_images/workspace.png](img/c9c3087ea25e6c3f848030b33b06de8f.jpg)
After finishing the above 5 steps, you are ready to run your Spark code on Databricks Community Cloud. I will run all the following demos on Databricks Community Cloud. Hopefully, when you run the demo code, you will get the following results:
1. 创建运行代码的集群
> ```
> +---+-----+-----+---------+-----+
> |_c0| TV|Radio|Newspaper|Sales|
> +---+-----+-----+---------+-----+
> | 1|230.1| 37.8| 69.2| 22.1|
> | 2| 44.5| 39.3| 45.1| 10.4|
> | 3| 17.2| 45.9| 69.3| 9.3|
> | 4|151.5| 41.3| 58.5| 18.5|
> | 5|180.8| 10.8| 58.4| 12.9|
> +---+-----+-----+---------+-----+
> only showing top 5 rows
>
> root
> |-- _c0: integer (nullable = true)
> |-- TV: double (nullable = true)
> |-- Radio: double (nullable = true)
> |-- Newspaper: double (nullable = true)
> |-- Sales: double (nullable = true)
>
> ```
![https://runawayhorse001.github.io/LearningApacheSpark/_images/cluster.png](img/fdfe96b0b4fdfbfd862a698dc64ce34a.jpg)
## 3.2\. Configure Spark on Mac and Ubuntu
1. 导入你的数据集
### 3.2.1\. Installing Prerequisites
![https://runawayhorse001.github.io/LearningApacheSpark/_images/table.png](img/b7721ad6f461509452813013157c7a5e.jpg)
![https://runawayhorse001.github.io/LearningApacheSpark/_images/dataset1.png](img/b8c9ccb17235ad37b2b0fee18853efe6.jpg)
I will strongly recommend you to install [Anaconda](https://www.anaconda.com/download/), since it contains most of the prerequisites and support multiple Operator Systems.
> 注意
>
> 您需要保存`Uploaded to DBFS`中显示的路径: `/FileStore/tables/05rmhuqv1489687378010/`,由于我们会使用这个路径来上传数据集。
1. **Install Python**
1. 创建你的笔记本
Go to Ubuntu Software Center and follow the following steps:
![https://runawayhorse001.github.io/LearningApacheSpark/_images/notebook.png](img/edb67528127916e7e274addf9ad96029.jpg)
![https://runawayhorse001.github.io/LearningApacheSpark/_images/codenotebook.png](img/8973b73843e90120de5f556d5084eb49.jpg)
> 1. Open Ubuntu Software Center
> 2. Search for python
> 3. And click Install
完成上述 5 个步骤后,您就可以在 Databricks 社区云上运行 Spark 代码了。 我将在 Databricks 社区云上运行以下所有演示。在运行演示代码时,希望您将获得以下结果:
Or Open your terminal and using the following command:
```
+---+-----+-----+---------+-----+
|_c0| TV|Radio|Newspaper|Sales|
+---+-----+-----+---------+-----+
| 1|230.1| 37.8| 69.2| 22.1|
| 2| 44.5| 39.3| 45.1| 10.4|
| 3| 17.2| 45.9| 69.3| 9.3|
| 4|151.5| 41.3| 58.5| 18.5|
| 5|180.8| 10.8| 58.4| 12.9|
+---+-----+-----+---------+-----+
only showing top 5 rows
root
|-- _c0: integer (nullable = true)
|-- TV: double (nullable = true)
|-- Radio: double (nullable = true)
|-- Newspaper: double (nullable = true)
|-- Sales: double (nullable = true)
```
## 3.2\. 在 Mac 和 Ubuntu 上配置 Spark
### 3.2.1\. 安装先决条件
我强烈建议您安装 [Anaconda](https://www.anaconda.com/download/),因为它包含大部分先决条件并支持多个操作系统。
**安装 Python**
转到 Ubuntu 软件中心并按照以下步骤操作:
1. 打开 Ubuntu 软件中心
2. 搜索 python
3. 并点击“安装”
或者打开终端执行以下命令:
```bash
sudo apt-get install build-essential checkinstall
sudo apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev
libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev
......@@ -83,40 +85,40 @@ sudo pip install ipython
```
### 3.2.2\. Install Java
### 3.2.2\. 安装 Java
Java is used by many other softwares. So it is quite possible that you have already installed it. You can by using the following command in Command Prompt:
Java 被许多其他软件使用。 所以你很可能已经安装了它。 您可以在命令提示符中使用以下命令:
```
```bash
java -version
```
Otherwise, you can follow the steps in [How do I install Java for my Mac?](https://java.com/en/download/help/mac_install.xml) to install java on Mac and use the following command in Command Prompt to install on Ubuntu:
否则,您可以按照[如何为我的 Mac 安装 Java?](https://java.com/en/download/help/mac_install.xml)中的步骤,在 Mac 上安装 java 并在命令提示符中使用以下命令来在 Ubuntu 上安装:
```
```bash
sudo apt-add-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
```
### 3.2.3\. Install Java SE Runtime Environment
### 3.2.3\. 安装 JRE
I installed ORACLE [Java JDK](http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html).
我安装了 ORACLE [Java JDK](http://www.oracle.com/technetwork/java/javase/downloads/index-jsp-138363.html)
Warning
**Installing Java and Java SE Runtime Environment steps are very important, since Spark is a domain-specific language written in Java.**
> 警告
>
> **安装 Java 和 Java SE 运行时环境的步骤非常重要,因为 Spark 是一种用 Java 编写的领域特定语言。**
You can check if your Java is available and find it’s version by using the following command in Command Prompt:
您可以在命令提示符中使用以下命令检查 Java 是否可用并找到它的版本:
```
```bash
java -version
```
If your Java is installed successfully, you will get the similar results as follows:
如果您的 Java 安装成功,您将获得如下的类似结果:
```
java version "1.8.0_131"
......@@ -125,231 +127,232 @@ Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
```
### 3.2.4\. Install Apache Spark
### 3.2.4\. 安装 Apache Spark
Actually, the Pre-build version doesn’t need installation. You can use it when you unpack it.
实际上,预构建版本不需要安装。 你在解包时可以使用它。
> 1. Download: You can get the Pre-built Apache Spark™ from [Download Apache Spark™](http://spark.apache.org/downloads.html).
> 2. Unpack: Unpack the Apache Spark™ to the path where you want to install the Spark.
> 3. Test: Test the Prerequisites: change the direction `spark-#.#.#-bin-hadoop#.#/bin` and run
>
> ```
> ./pyspark
>
> ```
>
> ```
> Python 2.7.13 |Anaconda 4.4.0 (x86_64)| (default, Dec 20 2016, 23:05:08)
> [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Anaconda is brought to you by Continuum Analytics.
> Please check out: http://continuum.io/thanks and https://anaconda.org
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR,
> use setLogLevel(newLevel).
> 17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
> returning NoSuchObjectException
> Welcome to
> ____ __
> / __/__ ___ _____/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.1.1
> /_/
>
> Using Python version 2.7.13 (default, Dec 20 2016 23:05:08)
> SparkSession available as 'spark'.
>
> ```
1. 下载:您可以从 [下载 Apache Spark™](http://spark.apache.org/downloads.html) 获得预构建的 Apache Spark™。
2. 解压缩:将 Apache Spark™ 解压缩到您要安装 Spark 的路径。
3. 测试:测试先决条件:修改路径`spark-#.#.#-bin-hadoop#.#/bin`并运行
### 3.2.5\. Configure the Spark
```bash
./pyspark
> 1. **Mac Operator System:** open your `bash_profile` in Terminal
>
> ```
> vim ~/.bash_profile
>
> ```
>
> And add the following lines to your `bash_profile` (remember to change the path)
>
> ```
> # add for spark
> export SPARK_HOME=your_spark_installation_path
> export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
> export PATH=$PATH:$SPARK_HOME/bin
> export PYSPARK_DRIVE_PYTHON="jupyter"
> export PYSPARK_DRIVE_PYTHON_OPTS="notebook"
>
> ```
>
> At last, remember to source your `bash_profile`
>
> ```
> source ~/.bash_profile
>
> ```
>
> 1. **Ubuntu Operator Sysytem:** open your `bashrc` in Terminal
>
> ```
> vim ~/.bashrc
>
> ```
>
> And add the following lines to your `bashrc` (remember to change the path)
>
> ```
> # add for spark
> export SPARK_HOME=your_spark_installation_path
> export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
> export PATH=$PATH:$SPARK_HOME/bin
> export PYSPARK_DRIVE_PYTHON="jupyter"
> export PYSPARK_DRIVE_PYTHON_OPTS="notebook"
>
> ```
>
> At last, remember to source your `bashrc`
>
> ```
> source ~/.bashrc
>
> ```
```
## 3.3\. Configure Spark on Windows
```
Python 2.7.13 |Anaconda 4.4.0 (x86_64)| (default, Dec 20 2016, 23:05:08)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR,
use setLogLevel(newLevel).
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.1
/_/
Installing open source software on Windows is always a nightmare for me. Thanks for Deelesh Mandloi. You can follow the detailed procedures in the blog [Getting Started with PySpark on Windows](http://deelesh.github.io/pyspark-windows.html) to install the Apache Spark™ on your Windows Operator System.
Using Python version 2.7.13 (default, Dec 20 2016 23:05:08)
SparkSession available as 'spark'.
## 3.4\. PySpark With Text Editor or IDE
```
### 3.4.1\. PySpark With Jupyter Notebook
### 3.2.5\. 配置 Spark
After you finishing the above setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), then you should be good to write and run your PySpark Code in Jupyter notebook.
1. **Mac 操作系统:**在终端打开你的`bash_profile`
```bash
vim ~/.bash_profile
```
并将以下行添加到`bash_profile`(记得改变路径)
```bash
# 为 spark 添加
export SPARK_HOME=your_spark_installation_path
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVE_PYTHON="jupyter"
export PYSPARK_DRIVE_PYTHON_OPTS="notebook"
```
最后,记得执行你的`bash_profile`
```bash
source ~/.bash_profile
```
1. **Ubuntu 操作系统:**在终端打开`bashrc`
```bash
vim ~/.bashrc
```
并将以下行添加到`bashrc`(记得改变路径)
```bash
# 为 spark 添加
export SPARK_HOME=your_spark_installation_path
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVE_PYTHON="jupyter"
export PYSPARK_DRIVE_PYTHON_OPTS="notebook"
```
最后,记得执行你的`bash_profile`
```bash
source ~/.bashrc
```
## 3.3\. 在 Windows 上配置 Spark
在 Windows 上安装开源软件对我来说总是一场噩梦。 感谢 Deelesh Mandloi。 您可以按照博客[ Windows 上的 PySpark 入门](http://deelesh.github.io/pyspark-windows.html)中的详细步骤,在 Windows 操作系统上安装 Apache Spark™。
## 3.4\. PySpark 和文本编辑器或 IDE
### 3.4.1\. PySpark 和 Jupyter 笔记本
完成[在 Mac 和 Ubuntu 上配置 Spark](#setup-up-ubuntu)中的上述设置步骤后,您应该在 Jupyter 笔 记本中编写和运行 PySpark 代码。
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/jupyterWithPySpark.png](img/90a1240e7489f989b9a4e5739b1efbd5.jpg)
### 3.4.2\. PySpark With Apache Zeppelin
### 3.4.2\. PySpark Apache Zeppelin
After you finishing the above setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), then you should be good to write and run your PySpark Code in Apache Zeppelin.
完成[在 Mac 和 Ubuntu 上配置 Spark](#setup-up-ubuntu)中的上述设置步骤后,您应该在 Apache Zeppelin 中编写和运行 PySpark 代码。
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/zeppelin.png](img/067197a5eeb69cc2f3d828a92ebcf52e.jpg)
### 3.4.3\. PySpark With Sublime Text
### 3.4.3\. PySpark Sublime Text
After you finishing the above setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), then you should be good to use Sublime Text to write your PySpark Code and run your code as a normal python code in Terminal.
完成[在 Mac 和 Ubuntu 上配置 Spark](#setup-up-ubuntu)中的上述设置步骤后,您应该可以使用 Sublime Text 编写 PySpark 代码,并在终端中将代码作为普通的 python 代码运行。
> ```
> python test_pyspark.py
>
> ```
```bash
python test_pyspark.py
```
Then you should get the output results in your terminal.
然后你应该在你的终端获得输出结果。
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/sublimeWithPySpark.png](img/c51fb942d508d4161e72d0075a5284e7.jpg)
### 3.4.4\. PySpark With Eclipse
### 3.4.4\. PySpark Eclipse
If you want to run PySpark code on Eclipse, you need to add the paths for the **External Libraries** for your **Current Project** as follows:
如果要在 Eclipse 上运行 PySpark 代码,则需要为**当前项目**添加**外部库**的路径,如下所示:
> 1. Open the properties of your project
>
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/PyDevProperties.png](img/f18ecec7a6c176301d7370e41a0a60dd.jpg)
>
> 1. Add the paths for the **External Libraries**
>
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/pydevPath.png](img/197517339d2ce744dd0a46c607e84534.jpg)
1. 打开你的项目的属性
And then you should be good to run your code on Eclipse with PyDev.
![https://runawayhorse001.github.io/LearningApacheSpark/_images/PyDevProperties.png](img/f18ecec7a6c176301d7370e41a0a60dd.jpg)
1.**外部**添加路径
![https://runawayhorse001.github.io/LearningApacheSpark/_images/pydevPath.png](img/197517339d2ce744dd0a46c607e84534.jpg)
然后你应该足以用 PyDev 在 Eclipse 上运行你的代码。
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/pysparkWithEclipse.png](img/6f2adb68d3f0a7f1f3af2ef044441071.jpg)
## 3.5\. PySparkling Water: Spark + H2O
## 3.5\. PySparkling : Spark + H2O
1. Download `Sparkling Water` from: [https://s3.amazonaws.com/h2o-release/sparkling-water/rel-2.4/5/index.html](https://s3.amazonaws.com/h2o-release/sparkling-water/rel-2.4/5/index.html)
2. Test PySparking
1.[https://s3.amazonaws.com/h2o-release/sparkling-water/rel-2.4/5/index.html](https://s3.amazonaws.com/h2o-release/sparkling-water/rel-2.4/5/index.html) 下载`Sparkling Water`
```
unzip sparkling-water-2.4.5.zip
cd ~/sparkling-water-2.4.5/bin
./pysparkling
2. 测试 PySparking
```
```bash
unzip sparkling-water-2.4.5.zip
cd ~/sparkling-water-2.4.5/bin
./pysparkling
If you have a correct setup for PySpark, then you will get the following results:
```
```
Using Spark defined in the SPARK_HOME=/Users/dt216661/spark environmental property
如果您有正确设置了 PySpark,那么您将获得以下结果:
Python 3.7.1 (default, Dec 14 2018, 13:28:58)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-02-15 14:08:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040\. Attempting port 4041.
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041\. Attempting port 4042.
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
```
Using Spark defined in the SPARK_HOME=/Users/dt216661/spark environmental property
Using Python version 3.7.1 (default, Dec 14 2018 13:28:58)
SparkSession available as 'spark'.
Python 3.7.1 (default, Dec 14 2018, 13:28:58)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2019-02-15 14:08:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040\. Attempting port 4041.
2019-02-15 14:08:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4041\. Attempting port 4042.
17/08/30 13:30:12 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
17/08/30 13:30:17 WARN ObjectStore: Failed to get database global_temp,
returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
```
Using Python version 3.7.1 (default, Dec 14 2018 13:28:58)
SparkSession available as 'spark'.
1. Setup `pysparkling` with Jupyter notebook
```
Add the following alias to your `bashrc` (Linux systems) or `bash_profile` (Mac system)
1. 使用 Jupyter notebook `pysparkling`
```
alias sparkling="PYSPARK_DRIVER_PYTHON="ipython" PYSPARK_DRIVER_PYTHON_OPTS= "notebook" /~/~/sparkling-water-2.4.5/bin/pysparkling"
将以下别名添加到`bashrc`(Linux 系统)或`bash_profile`(Mac 系统)
```
```bash
alias sparkling="PYSPARK_DRIVER_PYTHON="ipython" PYSPARK_DRIVER_PYTHON_OPTS= "notebook" /~/~/sparkling-water-2.4.5/bin/pysparkling"
1. Open `pysparkling` in terminal
```
```
sparkling
1. 在终端打开`pysparkling`
```
```bash
sparkling
## 3.6\. Set up Spark on Cloud
```
Following the setup steps in [Configure Spark on Mac and Ubuntu](#set-up-ubuntu), you can set up your own cluster on the cloud, for example AWS, Google Cloud. Actually, for those clouds, they have their own Big Data tool. Yon can run them directly whitout any setting just like Databricks Community Cloud. If you want more details, please feel free to contact with me.
## 3.6\. 在云上配置 Spark
## 3.7\. Demo Code in this Section
按照[在 Mac 和 Ubuntu 上配置 Spark](#setup-up-ubuntu)中的设置步骤,您可以在云上设置自己的集群,例如 AWS,Google Cloud。 实际上,对于那些云,他们有自己的大数据工具。 你可以直接在任何设置上运行它们,就像 Databricks 社区云一样。 如果您想了解更多详情,请随时与作者联系。
The code for this section is available for download [test_pyspark](static/test_pyspark.py), and the Jupyter notebook can be download from [test_pyspark_ipynb](static/test_pyspark.ipynb).
## 3.7\. 这一节的示例代码
* Python Source code
此部分的代码可在[`test_pyspark`](static/test_pyspark.py)下载,Jupyter 笔记本可从[`test_pyspark_ipynb`](static/test_pyspark.ipynb)下载。
> ```
> ## set up SparkSession
> from pyspark.sql import SparkSession
>
> spark = SparkSession \
> .builder \
> .appName("Python Spark SQL basic example") \
> .config("spark.some.config.option", "some-value") \
> .getOrCreate()
>
> df = spark.read.format('com.databricks.spark.csv').\
> options(header='true', \
> inferschema='true').\
> load("/home/feng/Spark/Code/data/Advertising.csv",header=True)
>
> df.show(5)
> df.printSchema()
>
> ```
\ No newline at end of file
* Python 源代码
```py
## 建立 SparkSession
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.format('com.databricks.spark.csv').\
options(header='true', \
inferschema='true').\
load("/home/feng/Spark/Code/data/Advertising.csv",header=True)
df.show(5)
df.printSchema()
```
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册