提交 241b2a95 编写于 作者: W wizardforcel

4

上级 6a5a1a7c
# 4\. An Introduction to Apache Spark
Chinese proverb
**Know yourself and know your enemy, and you will never be defeated** – idiom, from Sunzi’s Art of War
## 4.1\. Core Concepts
Most of the following content comes from [[Kirillov2016]](reference.html#kirillov2016). So the copyright belongs to **Anton Kirillov**. I will refer you to get more details from [Apache Spark core concepts, architecture and internals](http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/).
Before diving deep into how Apache Spark works, lets understand the jargon of Apache Spark
> * Job: A piece of code which reads some input from HDFS or local, performs some computation on the data and writes some output data.
> * Stages: Jobs are divided into stages. Stages are classified as a Map or reduce stages (Its easier to understand if you have worked on Hadoop and want to correlate). Stages are divided based on computational boundaries, all computations (operators) cannot be Updated in a single Stage. It happens over many stages.
> * Tasks: Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor (machine).
> * DAG: DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
> * Executor: The process responsible for executing a task.
> * Master: The machine on which the Driver program runs
> * Slave: The machine on which the Executor program runs
## 4.2\. Spark Components
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-components.png](img/f4e95f92187a42f257864cd22193c8ad.jpg)
>
> 1. Spark Driver
>
> > * separate process to execute user applications
> > * creates SparkContext to schedule jobs execution and negotiate with cluster manager
>
> 1. Executors
>
> > * run tasks scheduled by driver
> > * store computation results in memory, on disk or off-heap
> > * interact with storage systems
>
> 1. Cluster Manager
>
> > * Mesos
> > * YARN
> > * Spark Standalone
Spark Driver contains more components responsible for translation of user code into actual jobs executed on cluster:
> > ![https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-components1.png](img/0ff87df50cf4610da54dd94b51c6d809.jpg)
>
> * SparkContext
>
>
>
> > * represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster
>
>
> * DAGScheduler
>
>
> * computes a DAG of stages for each job and submits them to TaskScheduler determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum schedule to run the jobs
> * TaskScheduler
>
>
>
> > * responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers
>
>
> * SchedulerBackend
>
>
>
> > * backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN, Standalone, local)
>
>
> * BlockManager
>
>
>
> > * provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap)
## 4.3\. Architecture
## 4.4\. How Spark Works?
Spark has a small code base and the system is divided in various layers. Each layer has some responsibilities. The layers are independent of each other.
The first layer is the interpreter, Spark uses a Scala interpreter, with some modifications. As you enter your code in spark console (creating RDD’s and applying operators), Spark creates a operator graph. When the user runs an action (like collect), the Graph is submitted to a DAG Scheduler. The DAG scheduler divides operator graph into (map and reduce) stages. A stage is comprised of tasks based on partitions of the input data. The DAG scheduler pipelines operators together to optimize the graph. For e.g. Many map operators can be scheduled in a single stage. This optimization is key to Sparks performance. The final result of a DAG scheduler is a set of stages. The stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager. (Spark Standalone/Yarn/Mesos). The task scheduler doesn’t know about dependencies among stages.
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/work_flow.png](img/0ebcb4677d2131e71e039be8ea955cff.jpg)
\ No newline at end of file
# 4\. Apache Spark 入门
**知己知彼,百战百胜。** – 《孙子兵法》
## 4.1\. 核心概念
以下大部分内容来自 [[Kirillov2016]](reference.html#kirillov2016)。 所以版权属于 **Anton Kirillov**。我向您推荐[ Apache Spark 核心概念,架构和内部](http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/)来获取更多详细信息。
在深入研究 Apache Spark 之前,让我们阅读 Apache Spark 的行话。
* 作业:从 HDFS 或本地读取一些输入的代码,对数据执行一些计算并写入一些输出数据。
* 阶段:工作分为几个阶段。 阶段被分类为映射或归约阶段(如果您已经使用过 Hadoop 并希望关联,则更容易理解)。 阶段基于计算边界划分,所有计算(操作符)不能在单个阶段中更新。 它发生在很多阶段。
* 任务:每个阶段都有一些任务,每个任务一个分区。 一个任务执行在一个执行器(机器)上的一个数据分区上。
* DAG:DAG 代表有向无环图,在本文中是操作符的 DAG。
* 执行器:负责执行任务的进程。
* 主机:运行驱动程序的机器
* 从机:运行执行程序的机器
## 4.2\. Spark 组件
> ![https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-components.png](img/f4e95f92187a42f257864cd22193c8ad.jpg)
1. Spark 驱动
* 隔离进程来执行用户应用
* 创建`SparkContext`来调度任务执行并与集群管理器协商
1. 执行器
* 运行由驱动调度的任务
* 在内存、磁盘或者 off-heap 中储存计算结果
* 与储存系统个交互
1. 集群管理器
* Mesos
* YARN
* Spark Standalone
Spark 驱动包含更多组件,负责将用户代码转换为集群上的实际作业:
![https://runawayhorse001.github.io/LearningApacheSpark/_images/spark-components1.png](img/0ff87df50cf4610da54dd94b51c6d809.jpg)
* `SparkContext`
* 表示 spark 集群的连接,可用于在该集群上创建 RDD,累加器和广播变量
* `DAGScheduler`
* 计算机每个作业的阶段的 DAG 并将它们提交给`TaskScheduler`,来确定任务的首选位置(基于缓存状态或随机文件位置)并找到运行作业的最小调度
* `TaskScheduler`
* 负责将任务发送到集群,运行它们,在发生故障时重试,以及减轻负担
* `SchedulerBackend`
* 用于调度系统的后端接口,允许插入不同的实现(mesos,yarn,standalone,local)
* `BlockManager`
* 提供接口,用于在本地和远程将块放置到各种存储(内存,磁盘和堆外)和检索
## 4.3\. 架构
## 4.4\. Spark 的工作原理
Spark 具有较小的代码库,系统分为不同的层。 每一层都有一些责任。 这些层彼此独立。
第一层是解释器,Spark 使用 Scala 解释器,并进行了一些修改。 当您在 spark 控制台中输入代码(创建 RDD 并应用操作符)时,Spark 会创建一个操作符图。 当用户运行操作(如收集)时,图将提交给 DAG 调度器。 DAG 调度器将操作符图分为(映射和归约)阶段。 阶段由基于输入数据的分区的任务组成。 DAG 调度器将操作符连接在一起来优化图。 例如 许多图的操作符可以安排在一个阶段中。 此优化是 Sparks 性能的关键。 DAG 调度器的最终结果是一组阶段。 这些阶段将传递给任务调度器。 任务调度器通过集群管理器启动任务(Spark Standalone/Yarn/Mesos)。 任务调度器不知道阶段之间的依赖关系。
![https://runawayhorse001.github.io/LearningApacheSpark/_images/work_flow.png](img/0ebcb4677d2131e71e039be8ea955cff.jpg)
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册