未验证 提交 390928dd 编写于 作者: I ImPerat0R_ 提交者: GitHub

Merge pull request #62 from zhongjiajie/remove_en_folder

Remove en folder
......@@ -62,8 +62,6 @@
+ 术语使用
+ 代码格式
原文在`http://airflow.apachecn.org/en/{name}.html`,文件名相同。
### 三、提交
+ `fork` Github 项目
......
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Apache Airflow (incubating) Documentation</h1>
<p>From: <a href="https://airflow.apache.org/">https://airflow.apache.org/</a></p>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Project</h1>
<div class="section" id="history">
<h2 class="sigil_not_in_toc">History</h2>
<p>Airflow was started in October 2014 by Maxime Beauchemin at Airbnb.
It was open source from the very first commit and officially brought under
the Airbnb Github and announced in June 2015.</p>
<p>The project joined the Apache Software Foundation&#x2019;s incubation program in March 2016.</p>
</div>
<div class="section" id="committers">
<h2 class="sigil_not_in_toc">Committers</h2>
<ul class="simple">
<li>@mistercrunch (Maxime &#x201C;Max&#x201D; Beauchemin)</li>
<li>@r39132 (Siddharth &#x201C;Sid&#x201D; Anand)</li>
<li>@criccomini (Chris Riccomini)</li>
<li>@bolkedebruin (Bolke de Bruin)</li>
<li>@artwr (Arthur Wiedmer)</li>
<li>@jlowin (Jeremiah Lowin)</li>
<li>@patrickleotardif (Patrick Leo Tardif)</li>
<li>@aoen (Dan Davydov)</li>
<li>@syvineckruyk (Steven Yvinec-Kruyk)</li>
<li>@msumit (Sumit Maheshwari)</li>
<li>@alexvanboxel (Alex Van Boxel)</li>
<li>@saguziel (Alex Guziel)</li>
<li>@joygao (Joy Gao)</li>
<li>@fokko (Fokko Driesprong)</li>
<li>@ash (Ash Berlin-Taylor)</li>
<li>@kaxilnaik (Kaxil Naik)</li>
<li>@feng-tao (Tao Feng)</li>
</ul>
<p>For the full list of contributors, take a look at <a class="reference external" href="https://github.com/apache/incubator-airflow/graphs/contributors">Airflow&#x2019;s Github
Contributor page:</a></p>
</div>
<div class="section" id="resources-links">
<h2 class="sigil_not_in_toc">Resources &amp; links</h2>
<ul class="simple">
<li><a class="reference external" href="http://airflow.apache.org/">Airflow&#x2019;s official documentation</a></li>
<li>Mailing list (send emails to
<code class="docutils literal notranslate"><span class="pre">dev-subscribe@airflow.incubator.apache.org</span></code> and/or
<code class="docutils literal notranslate"><span class="pre">commits-subscribe@airflow.incubator.apache.org</span></code>
to subscribe to each)</li>
<li><a class="reference external" href="https://issues.apache.org/jira/browse/AIRFLOW">Issues on Apache&#x2019;s Jira</a></li>
<li><a class="reference external" href="https://gitter.im/airbnb/airflow">Gitter (chat) Channel</a></li>
<li><a class="reference external" href="https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Links">More resources and links to Airflow related content on the Wiki</a></li>
</ul>
</div>
<div class="section" id="roadmap">
<h2 class="sigil_not_in_toc">Roadmap</h2>
<p>Please refer to the Roadmap on <a class="reference external" href="https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Home">the wiki</a></p>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Managing Connections</h1>
<p>Airflow needs to know how to connect to your environment. Information
such as hostname, port, login and passwords to other systems and services is
handled in the <code class="docutils literal notranslate"><span class="pre">Admin-&gt;Connection</span></code> section of the UI. The pipeline code you
will author will reference the &#x2018;conn_id&#x2019; of the Connection objects.</p>
<img alt="https://airflow.apache.org/_images/connections.png" src="../img/b1caba93dd8fce8b3c81bfb0d58cbf95.jpg">
<p>Connections can be created and managed using either the UI or environment
variables.</p>
<p>See the <a class="reference internal" href="../concepts.html#concepts-connections"><span class="std std-ref">Connenctions Concepts</span></a> documentation for
more information.</p>
<div class="section" id="creating-a-connection-with-the-ui">
<h2 class="sigil_not_in_toc">Creating a Connection with the UI</h2>
<p>Open the <code class="docutils literal notranslate"><span class="pre">Admin-&gt;Connection</span></code> section of the UI. Click the <code class="docutils literal notranslate"><span class="pre">Create</span></code> link
to create a new connection.</p>
<img alt="https://airflow.apache.org/_images/connection_create.png" src="../img/635aacab53c55192ad3e31c28e65eb43.jpg">
<ol class="arabic simple">
<li>Fill in the <code class="docutils literal notranslate"><span class="pre">Conn</span> <span class="pre">Id</span></code> field with the desired connection ID. It is
recommended that you use lower-case characters and separate words with
underscores.</li>
<li>Choose the connection type with the <code class="docutils literal notranslate"><span class="pre">Conn</span> <span class="pre">Type</span></code> field.</li>
<li>Fill in the remaining fields. See
<a class="reference internal" href="#manage-connections-connection-types"><span class="std std-ref">Connection Types</span></a> for a description of the fields
belonging to the different connection types.</li>
<li>Click the <code class="docutils literal notranslate"><span class="pre">Save</span></code> button to create the connection.</li>
</ol>
</div>
<div class="section" id="editing-a-connection-with-the-ui">
<h2 class="sigil_not_in_toc">Editing a Connection with the UI</h2>
<p>Open the <code class="docutils literal notranslate"><span class="pre">Admin-&gt;Connection</span></code> section of the UI. Click the pencil icon next
to the connection you wish to edit in the connection list.</p>
<img alt="https://airflow.apache.org/_images/connection_edit.png" src="../img/08e0f3fedf871b535c850d202dda1422.jpg">
<p>Modify the connection properties and click the <code class="docutils literal notranslate"><span class="pre">Save</span></code> button to save your
changes.</p>
</div>
<div class="section" id="creating-a-connection-with-environment-variables">
<h2 class="sigil_not_in_toc">Creating a Connection with Environment Variables</h2>
<p>Connections in Airflow pipelines can be created using environment variables.
The environment variable needs to have a prefix of <code class="docutils literal notranslate"><span class="pre">AIRFLOW_CONN_</span></code> for
Airflow with the value in a URI format to use the connection properly.</p>
<p>When referencing the connection in the Airflow pipeline, the <code class="docutils literal notranslate"><span class="pre">conn_id</span></code>
should be the name of the variable without the prefix. For example, if the
<code class="docutils literal notranslate"><span class="pre">conn_id</span></code> is named <code class="docutils literal notranslate"><span class="pre">postgres_master</span></code> the environment variable should be
named <code class="docutils literal notranslate"><span class="pre">AIRFLOW_CONN_POSTGRES_MASTER</span></code> (note that the environment variable
must be all uppercase). Airflow assumes the value returned from the
environment variable to be in a URI format (e.g.
<code class="docutils literal notranslate"><span class="pre">postgres://user:password@localhost:5432/master</span></code> or
<code class="docutils literal notranslate"><span class="pre">s3://accesskey:secretkey@S3</span></code>).</p>
</div>
<div class="section" id="connection-types">
<span id="manage-connections-connection-types"></span><h2 class="sigil_not_in_toc">Connection Types</h2>
<div class="section" id="google-cloud-platform">
<span id="connection-type-gcp"></span><h3 class="sigil_not_in_toc">Google Cloud Platform</h3>
<p>The Google Cloud Platform connection type enables the <a class="reference internal" href="../integration.html#gcp"><span class="std std-ref">GCP Integrations</span></a>.</p>
<div class="section" id="authenticating-to-gcp">
<h4 class="sigil_not_in_toc">Authenticating to GCP</h4>
<p>There are two ways to connect to GCP using Airflow.</p>
<ol class="arabic simple">
<li>Use <a class="reference external" href="https://google-auth.readthedocs.io/en/latest/reference/google.auth.html#google.auth.default">Application Default Credentials</a>,
such as via the metadata server when running on Google Compute Engine.</li>
<li>Use a <a class="reference external" href="https://cloud.google.com/docs/authentication/#service_accounts">service account</a> key
file (JSON format) on disk.</li>
</ol>
</div>
<div class="section" id="default-connection-ids">
<h4 class="sigil_not_in_toc">Default Connection IDs</h4>
<p>The following connection IDs are used by default.</p>
<pre>bigquery_default</pre>
Used by the <a class="reference internal" href="../integration.html#airflow.contrib.hooks.bigquery_hook.BigQueryHook" title="airflow.contrib.hooks.bigquery_hook.BigQueryHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">BigQueryHook</span></code></a>
hook.
<pre>google_cloud_datastore_default</pre>
Used by the <a class="reference internal" href="../integration.html#airflow.contrib.hooks.datastore_hook.DatastoreHook" title="airflow.contrib.hooks.datastore_hook.DatastoreHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">DatastoreHook</span></code></a>
hook.
<pre>google_cloud_default</pre>
Used by the
<a class="reference internal" href="../code.html#airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook" title="airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">GoogleCloudBaseHook</span></code></a>,
<a class="reference internal" href="../integration.html#airflow.contrib.hooks.gcp_dataflow_hook.DataFlowHook" title="airflow.contrib.hooks.gcp_dataflow_hook.DataFlowHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFlowHook</span></code></a>,
<a class="reference internal" href="../code.html#airflow.contrib.hooks.gcp_dataproc_hook.DataProcHook" title="airflow.contrib.hooks.gcp_dataproc_hook.DataProcHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataProcHook</span></code></a>,
<a class="reference internal" href="../integration.html#airflow.contrib.hooks.gcp_mlengine_hook.MLEngineHook" title="airflow.contrib.hooks.gcp_mlengine_hook.MLEngineHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">MLEngineHook</span></code></a>, and
<a class="reference internal" href="../integration.html#airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook" title="airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook"><code class="xref py py-class docutils literal notranslate"><span class="pre">GoogleCloudStorageHook</span></code></a> hooks.
</div>
<div class="section" id="configuring-the-connection">
<h4 class="sigil_not_in_toc">Configuring the Connection</h4>
<pre>Project Id (required)</pre>
The Google Cloud project ID to connect to.
<pre>Keyfile Path</pre>
<p class="first">Path to a <a class="reference external" href="https://cloud.google.com/docs/authentication/#service_accounts">service account</a> key
file (JSON format) on disk.</p>
<p class="last">Not required if using application default credentials.</p>
<pre>Keyfile JSON</pre>
<p class="first">Contents of a <a class="reference external" href="https://cloud.google.com/docs/authentication/#service_accounts">service account</a> key
file (JSON format) on disk. It is recommended to <a class="reference internal" href="secure-connections.html"><span class="doc">Secure your connections</span></a> if using this method to authenticate.</p>
<p class="last">Not required if using application default credentials.</p>
<pre>Scopes (comma separated)</pre>
<p class="first">A list of comma-separated <a class="reference external" href="https://developers.google.com/identity/protocols/googlescopes">Google Cloud scopes</a> to
authenticate with.</p>
<div class="last admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Scopes are ignored when using application default credentials. See
issue <a class="reference external" href="https://issues.apache.org/jira/browse/AIRFLOW-2522">AIRFLOW-2522</a>.</p>
</div>
</div>
</div>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Securing Connections</h1>
<p>By default, Airflow will save the passwords for the connection in plain text
within the metadata database. The <code class="docutils literal notranslate"><span class="pre">crypto</span></code> package is highly recommended
during installation. The <code class="docutils literal notranslate"><span class="pre">crypto</span></code> package does require that your operating
system have libffi-dev installed.</p>
<p>If <code class="docutils literal notranslate"><span class="pre">crypto</span></code> package was not installed initially, you can still enable encryption for
connections by following steps below:</p>
<ol class="arabic simple">
<li>Install crypto package <code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[crypto]</span></code></li>
<li>Generate fernet_key, using this code snippet below. fernet_key must be a base64-encoded 32-byte key.</li>
</ol>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">cryptography.fernet</span> <span class="k">import</span> <span class="n">Fernet</span>
<span class="n">fernet_key</span><span class="o">=</span> <span class="n">Fernet</span><span class="o">.</span><span class="n">generate_key</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">fernet_key</span><span class="o">.</span><span class="n">decode</span><span class="p">())</span> <span class="c1"># your fernet_key, keep it in secured place!</span>
</pre>
</div>
</div>
<p>3. Replace <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> fernet_key value with the one from step 2.
Alternatively, you can store your fernet_key in OS environment variable. You
do not need to change <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> in this case as Airflow will use environment
variable over the value in <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Note the double underscores</span>
EXPORT <span class="nv">AIRFLOW__CORE__FERNET_KEY</span> <span class="o">=</span> your_fernet_key
</pre>
</div>
</div>
<ol class="arabic simple" start="4">
<li>Restart Airflow webserver.</li>
<li>For existing connections (the ones that you had defined before installing <code class="docutils literal notranslate"><span class="pre">airflow[crypto]</span></code> and creating a Fernet key), you need to open each connection in the connection admin UI, re-type the password, and save it.</li>
</ol>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Writing Logs</h1>
<div class="section" id="writing-logs-locally">
<h2 class="sigil_not_in_toc">Writing Logs Locally</h2>
<p>Users can specify a logs folder in <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> using the
<code class="docutils literal notranslate"><span class="pre">base_log_folder</span></code> setting. By default, it is in the <code class="docutils literal notranslate"><span class="pre">AIRFLOW_HOME</span></code>
directory.</p>
<p>In addition, users can supply a remote location for storing logs and log
backups in cloud storage.</p>
<p>In the Airflow Web UI, local logs take precedence over remote logs. If local logs
can not be found or accessed, the remote logs will be displayed. Note that logs
are only sent to remote storage once a task completes (including failure). In other
words, remote logs for running tasks are unavailable. Logs are stored in the log
folder as <code class="docutils literal notranslate"><span class="pre">{dag_id}/{task_id}/{execution_date}/{try_number}.log</span></code>.</p>
</div>
<div class="section" id="writing-logs-to-amazon-s3">
<span id="write-logs-amazon"></span><h2 class="sigil_not_in_toc">Writing Logs to Amazon S3</h2>
<div class="section" id="before-you-begin">
<h3 class="sigil_not_in_toc">Before you begin</h3>
<p>Remote logging uses an existing Airflow connection to read/write logs. If you
don&#x2019;t have a connection properly setup, this will fail.</p>
</div>
<div class="section" id="enabling-remote-logging">
<h3 class="sigil_not_in_toc">Enabling remote logging</h3>
<p>To enable this feature, <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> must be configured as in this
example:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>core<span class="o">]</span>
<span class="c1"># Airflow can store logs remotely in AWS S3. Users must supply a remote</span>
<span class="c1"># location URL (starting with either &apos;s3://...&apos;) and an Airflow connection</span>
<span class="c1"># id that provides access to the storage location.</span>
<span class="nv">remote_base_log_folder</span> <span class="o">=</span> s3://my-bucket/path/to/logs
<span class="nv">remote_log_conn_id</span> <span class="o">=</span> MyS3Conn
<span class="c1"># Use server-side encryption for logs stored in S3</span>
<span class="nv">encrypt_s3_logs</span> <span class="o">=</span> False
</pre>
</div>
</div>
<p>In the above example, Airflow will try to use <code class="docutils literal notranslate"><span class="pre">S3Hook(&apos;MyS3Conn&apos;)</span></code>.</p>
</div>
</div>
<div class="section" id="writing-logs-to-azure-blob-storage">
<span id="write-logs-azure"></span><h2 class="sigil_not_in_toc">Writing Logs to Azure Blob Storage</h2>
<p>Airflow can be configured to read and write task logs in Azure Blob Storage.
Follow the steps below to enable Azure Blob Storage logging.</p>
<ol class="arabic">
<li><p class="first">Airflow&#x2019;s logging system requires a custom .py file to be located in the <code class="docutils literal notranslate"><span class="pre">PYTHONPATH</span></code>, so that it&#x2019;s importable from Airflow. Start by creating a directory to store the config file. <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/config</span></code> is recommended.</p>
</li>
<li><p class="first">Create empty files called <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/config/log_config.py</span></code> and <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/config/__init__.py</span></code>.</p>
</li>
<li><p class="first">Copy the contents of <code class="docutils literal notranslate"><span class="pre">airflow/config_templates/airflow_local_settings.py</span></code> into the <code class="docutils literal notranslate"><span class="pre">log_config.py</span></code> file that was just created in the step above.</p>
</li>
<li><p class="first">Customize the following portions of the template:</p>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># wasb buckets should start with &quot;wasb&quot; just to help Airflow select correct handler</span>
<span class="nv">REMOTE_BASE_LOG_FOLDER</span> <span class="o">=</span> <span class="s1">&apos;wasb-&lt;whatever you want here&gt;&apos;</span>
<span class="c1"># Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG</span>
<span class="nv">LOGGING_CONFIG</span> <span class="o">=</span> ...
</pre>
</div>
</div>
</div>
</blockquote>
</li>
<li><p class="first">Make sure a Azure Blob Storage (Wasb) connection hook has been defined in Airflow. The hook should have read and write access to the Azure Blob Storage bucket defined above in <code class="docutils literal notranslate"><span class="pre">REMOTE_BASE_LOG_FOLDER</span></code>.</p>
</li>
<li><p class="first">Update <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/airflow.cfg</span></code> to contain:</p>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">remote_logging</span> <span class="o">=</span> True
<span class="nv">logging_config_class</span> <span class="o">=</span> log_config.LOGGING_CONFIG
<span class="nv">remote_log_conn_id</span> <span class="o">=</span> &lt;name of the Azure Blob Storage connection&gt;
</pre>
</div>
</div>
</div>
</blockquote>
</li>
<li><p class="first">Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.</p>
</li>
<li><p class="first">Verify that logs are showing up for newly executed tasks in the bucket you&#x2019;ve defined.</p>
</li>
</ol>
</div>
<div class="section" id="writing-logs-to-google-cloud-storage">
<span id="write-logs-gcp"></span><h2 class="sigil_not_in_toc">Writing Logs to Google Cloud Storage</h2>
<p>Follow the steps below to enable Google Cloud Storage logging.</p>
<ol class="arabic">
<li><p class="first">Airflow&#x2019;s logging system requires a custom .py file to be located in the <code class="docutils literal notranslate"><span class="pre">PYTHONPATH</span></code>, so that it&#x2019;s importable from Airflow. Start by creating a directory to store the config file. <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/config</span></code> is recommended.</p>
</li>
<li><p class="first">Create empty files called <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/config/log_config.py</span></code> and <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/config/__init__.py</span></code>.</p>
</li>
<li><p class="first">Copy the contents of <code class="docutils literal notranslate"><span class="pre">airflow/config_templates/airflow_local_settings.py</span></code> into the <code class="docutils literal notranslate"><span class="pre">log_config.py</span></code> file that was just created in the step above.</p>
</li>
<li><p class="first">Customize the following portions of the template:</p>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Add this variable to the top of the file. Note the trailing slash.</span>
<span class="nv">GCS_LOG_FOLDER</span> <span class="o">=</span> <span class="s1">&apos;gs://&lt;bucket where logs should be persisted&gt;/&apos;</span>
<span class="c1"># Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG</span>
<span class="nv">LOGGING_CONFIG</span> <span class="o">=</span> ...
<span class="c1"># Add a GCSTaskHandler to the &apos;handlers&apos; block of the LOGGING_CONFIG variable</span>
<span class="s1">&apos;gcs.task&apos;</span>: <span class="o">{</span>
<span class="s1">&apos;class&apos;</span>: <span class="s1">&apos;airflow.utils.log.gcs_task_handler.GCSTaskHandler&apos;</span>,
<span class="s1">&apos;formatter&apos;</span>: <span class="s1">&apos;airflow.task&apos;</span>,
<span class="s1">&apos;base_log_folder&apos;</span>: os.path.expanduser<span class="o">(</span>BASE_LOG_FOLDER<span class="o">)</span>,
<span class="s1">&apos;gcs_log_folder&apos;</span>: GCS_LOG_FOLDER,
<span class="s1">&apos;filename_template&apos;</span>: FILENAME_TEMPLATE,
<span class="o">}</span>,
<span class="c1"># Update the airflow.task and airflow.task_runner blocks to be &apos;gcs.task&apos; instead of &apos;file.task&apos;.</span>
<span class="s1">&apos;loggers&apos;</span>: <span class="o">{</span>
<span class="s1">&apos;airflow.task&apos;</span>: <span class="o">{</span>
<span class="s1">&apos;handlers&apos;</span>: <span class="o">[</span><span class="s1">&apos;gcs.task&apos;</span><span class="o">]</span>,
...
<span class="o">}</span>,
<span class="s1">&apos;airflow.task_runner&apos;</span>: <span class="o">{</span>
<span class="s1">&apos;handlers&apos;</span>: <span class="o">[</span><span class="s1">&apos;gcs.task&apos;</span><span class="o">]</span>,
...
<span class="o">}</span>,
<span class="s1">&apos;airflow&apos;</span>: <span class="o">{</span>
<span class="s1">&apos;handlers&apos;</span>: <span class="o">[</span><span class="s1">&apos;console&apos;</span><span class="o">]</span>,
...
<span class="o">}</span>,
<span class="o">}</span>
</pre>
</div>
</div>
</div>
</blockquote>
</li>
<li><p class="first">Make sure a Google Cloud Platform connection hook has been defined in Airflow. The hook should have read and write access to the Google Cloud Storage bucket defined above in <code class="docutils literal notranslate"><span class="pre">GCS_LOG_FOLDER</span></code>.</p>
</li>
<li><p class="first">Update <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/airflow.cfg</span></code> to contain:</p>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">task_log_reader</span> <span class="o">=</span> gcs.task
<span class="nv">logging_config_class</span> <span class="o">=</span> log_config.LOGGING_CONFIG
<span class="nv">remote_log_conn_id</span> <span class="o">=</span> &lt;name of the Google cloud platform hook&gt;
</pre>
</div>
</div>
</div>
</blockquote>
</li>
<li><p class="first">Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution.</p>
</li>
<li><p class="first">Verify that logs are showing up for newly executed tasks in the bucket you&#x2019;ve defined.</p>
</li>
<li><p class="first">Verify that the Google Cloud Storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like:</p>
<blockquote>
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>*** Reading remote log from gs://&lt;bucket where logs should be persisted&gt;/example_bash_operator/run_this_last/2017-10-03T00:00:00/16.log.
<span class="o">[</span><span class="m">2017</span>-10-03 <span class="m">21</span>:57:50,056<span class="o">]</span> <span class="o">{</span>cli.py:377<span class="o">}</span> INFO - Running on host chrisr-00532
<span class="o">[</span><span class="m">2017</span>-10-03 <span class="m">21</span>:57:50,093<span class="o">]</span> <span class="o">{</span>base_task_runner.py:115<span class="o">}</span> INFO - Running: <span class="o">[</span><span class="s1">&apos;bash&apos;</span>, <span class="s1">&apos;-c&apos;</span>, u<span class="s1">&apos;airflow run example_bash_operator run_this_last 2017-10-03T00:00:00 --job_id 47 --raw -sd DAGS_FOLDER/example_dags/example_bash_operator.py&apos;</span><span class="o">]</span>
<span class="o">[</span><span class="m">2017</span>-10-03 <span class="m">21</span>:57:51,264<span class="o">]</span> <span class="o">{</span>base_task_runner.py:98<span class="o">}</span> INFO - Subtask: <span class="o">[</span><span class="m">2017</span>-10-03 <span class="m">21</span>:57:51,263<span class="o">]</span> <span class="o">{</span>__init__.py:45<span class="o">}</span> INFO - Using executor SequentialExecutor
<span class="o">[</span><span class="m">2017</span>-10-03 <span class="m">21</span>:57:51,306<span class="o">]</span> <span class="o">{</span>base_task_runner.py:98<span class="o">}</span> INFO - Subtask: <span class="o">[</span><span class="m">2017</span>-10-03 <span class="m">21</span>:57:51,306<span class="o">]</span> <span class="o">{</span>models.py:186<span class="o">}</span> INFO - Filling up the DagBag from /airflow/dags/example_dags/example_bash_operator.py
</pre>
</div>
</div>
</div>
</blockquote>
</li>
</ol>
<p>Note the top line that says it&#x2019;s reading from the remote log file.</p>
<p>Please be aware that if you were persisting logs to Google Cloud Storage
using the old-style airflow.cfg configuration method, the old logs will no
longer be visible in the Airflow UI, though they&#x2019;ll still exist in Google
Cloud Storage. This is a backwards incompatbile change. If you are unhappy
with it, you can change the <code class="docutils literal notranslate"><span class="pre">FILENAME_TEMPLATE</span></code> to reflect the old-style
log filename format.</p>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Scaling Out with Celery</h1>
<p><code class="docutils literal notranslate"><span class="pre">CeleryExecutor</span></code> is one of the ways you can scale out the number of workers. For this
to work, you need to setup a Celery backend (<strong>RabbitMQ</strong>, <strong>Redis</strong>, &#x2026;) and
change your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> to point the executor parameter to
<code class="docutils literal notranslate"><span class="pre">CeleryExecutor</span></code> and provide the related Celery settings.</p>
<p>For more information about setting up a Celery broker, refer to the
exhaustive <a class="reference external" href="http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html">Celery documentation on the topic</a>.</p>
<p>Here are a few imperative requirements for your workers:</p>
<ul class="simple">
<li><code class="docutils literal notranslate"><span class="pre">airflow</span></code> needs to be installed, and the CLI needs to be in the path</li>
<li>Airflow configuration settings should be homogeneous across the cluster</li>
<li>Operators that are executed on the worker need to have their dependencies
met in that context. For example, if you use the <code class="docutils literal notranslate"><span class="pre">HiveOperator</span></code>,
the hive CLI needs to be installed on that box, or if you use the
<code class="docutils literal notranslate"><span class="pre">MySqlOperator</span></code>, the required Python library needs to be available in
the <code class="docutils literal notranslate"><span class="pre">PYTHONPATH</span></code> somehow</li>
<li>The worker needs to have access to its <code class="docutils literal notranslate"><span class="pre">DAGS_FOLDER</span></code>, and you need to
synchronize the filesystems by your own means. A common setup would be to
store your DAGS_FOLDER in a Git repository and sync it across machines using
Chef, Puppet, Ansible, or whatever you use to configure machines in your
environment. If all your boxes have a common mount point, having your
pipelines files shared there should work as well</li>
</ul>
<p>To kick off a worker, you need to setup Airflow and kick off the worker
subcommand</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>airflow worker
</pre>
</div>
</div>
<p>Your worker should start picking up tasks as soon as they get fired in
its direction.</p>
<p>Note that you can also run &#x201C;Celery Flower&#x201D;, a web UI built on top of Celery,
to monitor your workers. You can use the shortcut command <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">flower</span></code>
to start a Flower web server.</p>
<p>Some caveats:</p>
<ul class="simple">
<li>Make sure to use a database backed result backend</li>
<li>Make sure to set a visibility timeout in [celery_broker_transport_options] that exceeds the ETA of your longest running task</li>
<li>Tasks can and consume resources, make sure your worker as enough resources to run <cite>worker_concurrency</cite> tasks</li>
</ul>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Scaling Out with Dask</h1>
<p><code class="docutils literal notranslate"><span class="pre">DaskExecutor</span></code> allows you to run Airflow tasks in a Dask Distributed cluster.</p>
<p>Dask clusters can be run on a single machine or on remote networks. For complete
details, consult the <a class="reference external" href="https://distributed.readthedocs.io/">Distributed documentation</a>.</p>
<p>To create a cluster, first start a Scheduler:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># default settings for a local cluster</span>
<span class="nv">DASK_HOST</span><span class="o">=</span><span class="m">127</span>.0.0.1
<span class="nv">DASK_PORT</span><span class="o">=</span><span class="m">8786</span>
dask-scheduler --host <span class="nv">$DASK_HOST</span> --port <span class="nv">$DASK_PORT</span>
</pre>
</div>
</div>
<p>Next start at least one Worker on any machine that can connect to the host:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>dask-worker <span class="nv">$DASK_HOST</span>:<span class="nv">$DASK_PORT</span>
</pre>
</div>
</div>
<p>Edit your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> to set your executor to <code class="docutils literal notranslate"><span class="pre">DaskExecutor</span></code> and provide
the Dask Scheduler address in the <code class="docutils literal notranslate"><span class="pre">[dask]</span></code> section.</p>
<p>Please note:</p>
<ul class="simple">
<li>Each Dask worker must be able to import Airflow and any dependencies you
require.</li>
<li>Dask does not support queues. If an Airflow task was created with a queue, a
warning will be raised but the task will be submitted to the cluster.</li>
</ul>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Scaling Out with Mesos (community contributed)</h1>
<p>There are two ways you can run airflow as a mesos framework:</p>
<ol class="arabic simple">
<li>Running airflow tasks directly on mesos slaves, requiring each mesos slave to have airflow installed and configured.</li>
<li>Running airflow tasks inside a docker container that has airflow installed, which is run on a mesos slave.</li>
</ol>
<div class="section" id="tasks-executed-directly-on-mesos-slaves">
<h2 class="sigil_not_in_toc">Tasks executed directly on mesos slaves</h2>
<p><code class="docutils literal notranslate"><span class="pre">MesosExecutor</span></code> allows you to schedule airflow tasks on a Mesos cluster.
For this to work, you need a running mesos cluster and you must perform the following
steps -</p>
<ol class="arabic simple">
<li>Install airflow on a mesos slave where web server and scheduler will run,
let&#x2019;s refer to this as the &#x201C;Airflow server&#x201D;.</li>
<li>On the Airflow server, install mesos python eggs from <a class="reference external" href="http://open.mesosphere.com/downloads/mesos/">mesos downloads</a>.</li>
<li>On the Airflow server, use a database (such as mysql) which can be accessed from all mesos
slaves and add configuration in <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>.</li>
<li>Change your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> to point executor parameter to
<cite>MesosExecutor</cite> and provide related Mesos settings.</li>
<li>On all mesos slaves, install airflow. Copy the <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> from
Airflow server (so that it uses same sql alchemy connection).</li>
<li>On all mesos slaves, run the following for serving logs:</li>
</ol>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>airflow serve_logs
</pre>
</div>
</div>
<ol class="arabic simple" start="7">
<li>On Airflow server, to start processing/scheduling DAGs on mesos, run:</li>
</ol>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>airflow scheduler -p
</pre>
</div>
</div>
<p>Note: We need -p parameter to pickle the DAGs.</p>
<p>You can now see the airflow framework and corresponding tasks in mesos UI.
The logs for airflow tasks can be seen in airflow UI as usual.</p>
<p>For more information about mesos, refer to <a class="reference external" href="http://mesos.apache.org/documentation/latest/">mesos documentation</a>.
For any queries/bugs on <cite>MesosExecutor</cite>, please contact <a class="reference external" href="https://github.com/kapil-malik">@kapil-malik</a>.</p>
</div>
<div class="section" id="tasks-executed-in-containers-on-mesos-slaves">
<h2 class="sigil_not_in_toc">Tasks executed in containers on mesos slaves</h2>
<p><a class="reference external" href="https://gist.github.com/sebradloff/f158874e615bda0005c6f4577b20036e">This gist</a> contains all files and configuration changes necessary to achieve the following:</p>
<ol class="arabic simple">
<li>Create a dockerized version of airflow with mesos python eggs installed.</li>
</ol>
<blockquote>
<div>We recommend taking advantage of docker&#x2019;s multi stage builds in order to achieve this. We have one Dockerfile that defines building a specific version of mesos from source (Dockerfile-mesos), in order to create the python eggs. In the airflow Dockerfile (Dockerfile-airflow) we copy the python eggs from the mesos image.</div>
</blockquote>
<ol class="arabic simple" start="2">
<li>Create a mesos configuration block within the <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>.</li>
</ol>
<blockquote>
<div>The configuration block remains the same as the default airflow configuration (default_airflow.cfg), but has the addition of an option <code class="docutils literal notranslate"><span class="pre">docker_image_slave</span></code>. This should be set to the name of the image you would like mesos to use when running airflow tasks. Make sure you have the proper configuration of the DNS record for your mesos master and any sort of authorization if any exists.</div>
</blockquote>
<ol class="arabic simple" start="3">
<li>Change your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> to point the executor parameter to
<cite>MesosExecutor</cite> (<cite>executor = SequentialExecutor</cite>).</li>
<li>Make sure your mesos slave has access to the docker repository you are using for your <code class="docutils literal notranslate"><span class="pre">docker_image_slave</span></code>.</li>
</ol>
<blockquote>
<div><a class="reference external" href="https://mesos.readthedocs.io/en/latest/docker-containerizer/#private-docker-repository">Instructions are available in the mesos docs.</a></div>
</blockquote>
<p>The rest is up to you and how you want to work with a dockerized airflow configuration.</p>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Running Airflow with systemd</h1>
<p>Airflow can integrate with systemd based systems. This makes watching your
daemons easy as systemd can take care of restarting a daemon on failure.
In the <code class="docutils literal notranslate"><span class="pre">scripts/systemd</span></code> directory you can find unit files that
have been tested on Redhat based systems. You can copy those to
<code class="docutils literal notranslate"><span class="pre">/usr/lib/systemd/system</span></code>. It is assumed that Airflow will run under
<code class="docutils literal notranslate"><span class="pre">airflow:airflow</span></code>. If not (or if you are running on a non Redhat
based system) you probably need to adjust the unit files.</p>
<p>Environment configuration is picked up from <code class="docutils literal notranslate"><span class="pre">/etc/sysconfig/airflow</span></code>.
An example file is supplied. Make sure to specify the <code class="docutils literal notranslate"><span class="pre">SCHEDULER_RUNS</span></code>
variable in this file when you run the scheduler. You
can also define here, for example, <code class="docutils literal notranslate"><span class="pre">AIRFLOW_HOME</span></code> or <code class="docutils literal notranslate"><span class="pre">AIRFLOW_CONFIG</span></code>.</p>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Running Airflow with upstart</h1>
<p>Airflow can integrate with upstart based systems. Upstart automatically starts all airflow services for which you
have a corresponding <code class="docutils literal notranslate"><span class="pre">*.conf</span></code> file in <code class="docutils literal notranslate"><span class="pre">/etc/init</span></code> upon system boot. On failure, upstart automatically restarts
the process (until it reaches re-spawn limit set in a <code class="docutils literal notranslate"><span class="pre">*.conf</span></code> file).</p>
<p>You can find sample upstart job files in the <code class="docutils literal notranslate"><span class="pre">scripts/upstart</span></code> directory. These files have been tested on
Ubuntu 14.04 LTS. You may have to adjust <code class="docutils literal notranslate"><span class="pre">start</span> <span class="pre">on</span></code> and <code class="docutils literal notranslate"><span class="pre">stop</span> <span class="pre">on</span></code> stanzas to make it work on other upstart
systems. Some of the possible options are listed in <code class="docutils literal notranslate"><span class="pre">scripts/upstart/README</span></code>.</p>
<p>Modify <code class="docutils literal notranslate"><span class="pre">*.conf</span></code> files as needed and copy to <code class="docutils literal notranslate"><span class="pre">/etc/init</span></code> directory. It is assumed that airflow will run
under <code class="docutils literal notranslate"><span class="pre">airflow:airflow</span></code>. Change <code class="docutils literal notranslate"><span class="pre">setuid</span></code> and <code class="docutils literal notranslate"><span class="pre">setgid</span></code> in <code class="docutils literal notranslate"><span class="pre">*.conf</span></code> files if you use other user/group</p>
<p>You can use <code class="docutils literal notranslate"><span class="pre">initctl</span></code> to manually start, stop, view status of the airflow process that has been
integrated with upstart</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>initctl airflow-webserver status
</pre>
</div>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Using the Test Mode Configuration</h1>
<p>Airflow has a fixed set of &#x201C;test mode&#x201D; configuration options. You can load these
at any time by calling <code class="docutils literal notranslate"><span class="pre">airflow.configuration.load_test_config()</span></code> (note this
operation is not reversible!). However, some options (like the DAG_FOLDER) are
loaded before you have a chance to call load_test_config(). In order to eagerly load
the test configuration, set test_mode in airflow.cfg:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>tests<span class="o">]</span>
<span class="nv">unit_test_mode</span> <span class="o">=</span> True
</pre>
</div>
</div>
<p>Due to Airflow&#x2019;s automatic environment variable expansion (see <a class="reference internal" href="set-config.html"><span class="doc">Setting Configuration Options</span></a>),
you can also set the env var <code class="docutils literal notranslate"><span class="pre">AIRFLOW__CORE__UNIT_TEST_MODE</span></code> to temporarily overwrite
airflow.cfg.</p>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>UI / Screenshots</h1>
<p>The Airflow UI make it easy to monitor and troubleshoot your data pipelines.
Here&#x2019;s a quick overview of some of the features and visualizations you
can find in the Airflow UI.</p>
<div class="section" id="dags-view">
<h2 class="sigil_not_in_toc">DAGs View</h2>
<p>List of the DAGs in your environment, and a set of shortcuts to useful pages.
You can see exactly how many tasks succeeded, failed, or are currently
running at a glance.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/dags.png" src="../img/31a64f6b60a7f97f88c4b557992d0f14.jpg">
</div>
<hr class="docutils">
<div class="section" id="tree-view">
<h2 class="sigil_not_in_toc">Tree View</h2>
<p>A tree representation of the DAG that spans across time. If a pipeline is
late, you can quickly see where the different steps are and identify
the blocking ones.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/tree.png" src="../img/ad4ba22a6a3d5668fc19e0461f82e192.jpg">
</div>
<hr class="docutils">
<div class="section" id="graph-view">
<h2 class="sigil_not_in_toc">Graph View</h2>
<p>The graph view is perhaps the most comprehensive. Visualize your DAG&#x2019;s
dependencies and their current status for a specific run.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/graph.png" src="../img/bc05701b0ed4f5347e26c06452e8fd76.jpg">
</div>
<hr class="docutils">
<div class="section" id="variable-view">
<h2 class="sigil_not_in_toc">Variable View</h2>
<p>The variable view allows you to list, create, edit or delete the key-value pair
of a variable used during jobs. Value of a variable will be hidden if the key contains
any words in (&#x2018;password&#x2019;, &#x2018;secret&#x2019;, &#x2018;passwd&#x2019;, &#x2018;authorization&#x2019;, &#x2018;api_key&#x2019;, &#x2018;apikey&#x2019;, &#x2018;access_token&#x2019;)
by default, but can be configured to show in clear-text.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/variable_hidden.png" src="../img/9bf73cf3f89f830e70f800145ab51b10.jpg">
</div>
<hr class="docutils">
<div class="section" id="gantt-chart">
<h2 class="sigil_not_in_toc">Gantt Chart</h2>
<p>The Gantt chart lets you analyse task duration and overlap. You can quickly
identify bottlenecks and where the bulk of the time is spent for specific
DAG runs.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/gantt.png" src="../img/cfaa010349b1e40164cabb36c3b7dc1b.jpg">
</div>
<hr class="docutils">
<div class="section" id="task-duration">
<h2 class="sigil_not_in_toc">Task Duration</h2>
<p>The duration of your different tasks over the past N runs. This view lets
you find outliers and quickly understand where the time is spent in your
DAG over many runs.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/duration.png" src="../img/f0781c3598679db6605d7dfffc65c6a9.jpg">
</div>
<hr class="docutils">
<div class="section" id="code-view">
<h2 class="sigil_not_in_toc">Code View</h2>
<p>Transparency is everything. While the code for your pipeline is in source
control, this is a quick way to get to the code that generates the DAG and
provide yet more context.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/code.png" src="../img/b732d0bdc5c1a35f3ef34cc2d14b5199.jpg">
</div>
<hr class="docutils">
<div class="section" id="task-instance-context-menu">
<h2 class="sigil_not_in_toc">Task Instance Context Menu</h2>
<p>From the pages seen above (tree view, graph view, gantt, &#x2026;), it is always
possible to click on a task instance, and get to this rich context menu
that can take you to more detailed metadata, and perform some actions.</p>
<hr class="docutils">
<img alt="https://airflow.apache.org/_images/context.png" src="../img/c6288f9767ec25b7660ae86679773f69.jpg">
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
此差异已折叠。
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Data Profiling</h1>
<p>Part of being productive with data is having the right weapons to
profile the data you are working with. Airflow provides a simple query
interface to write SQL and get results quickly, and a charting application
letting you visualize data.</p>
<div class="section" id="adhoc-queries">
<h2 class="sigil_not_in_toc">Adhoc Queries</h2>
<p>The adhoc query UI allows for simple SQL interactions with the database
connections registered in Airflow.</p>
<img alt="https://airflow.apache.org/_images/adhoc.png" src="../img/bfbf60f9689630d6aa1f46aeab1e6cf0.jpg">
</div>
<div class="section" id="charts">
<h2 class="sigil_not_in_toc">Charts</h2>
<p>A simple UI built on top of flask-admin and highcharts allows building
data visualizations and charts easily. Fill in a form with a label, SQL,
chart type, pick a source database from your environment&#x2019;s connections,
select a few other options, and save it for later use.</p>
<p>You can even use the same templating and macros available when writing
airflow pipelines, parameterizing your queries and modifying parameters
directly in the URL.</p>
<p>These charts are basic, but they&#x2019;re easy to create, modify and share.</p>
<div class="section" id="chart-screenshot">
<h3 class="sigil_not_in_toc">Chart Screenshot</h3>
<img alt="https://airflow.apache.org/_images/chart.png" src="../img/a7247daabfaa0606cbb0d05e511194db.jpg">
</div>
<hr class="docutils">
<div class="section" id="chart-form-screenshot">
<h3 class="sigil_not_in_toc">Chart Form Screenshot</h3>
<img alt="https://airflow.apache.org/_images/chart_form.png" src="../img/a40de0ada10bc0250de4b6c082cb7660.jpg">
</div>
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Scheduling &amp; Triggers</h1>
<p>The Airflow scheduler monitors all tasks and all DAGs, and triggers the
task instances whose dependencies have been met. Behind the scenes,
it monitors and stays in sync with a folder for all DAG objects it may contain,
and periodically (every minute or so) inspects active tasks to see whether
they can be triggered.</p>
<p>The Airflow scheduler is designed to run as a persistent service in an
Airflow production environment. To kick it off, all you need to do is
execute <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">scheduler</span></code>. It will use the configuration specified in
<code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>.</p>
<p>Note that if you run a DAG on a <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> of one day,
the run stamped <code class="docutils literal notranslate"><span class="pre">2016-01-01</span></code> will be triggered soon after <code class="docutils literal notranslate"><span class="pre">2016-01-01T23:59</span></code>.
In other words, the job instance is started once the period it covers
has ended.</p>
<p><strong>Let&#x2019;s Repeat That</strong> The scheduler runs your job one <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> AFTER the
start date, at the END of the period.</p>
<p>The scheduler starts an instance of the executor specified in the your
<code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>. If it happens to be the <code class="docutils literal notranslate"><span class="pre">LocalExecutor</span></code>, tasks will be
executed as subprocesses; in the case of <code class="docutils literal notranslate"><span class="pre">CeleryExecutor</span></code> and
<code class="docutils literal notranslate"><span class="pre">MesosExecutor</span></code>, tasks are executed remotely.</p>
<p>To start a scheduler, simply run the command:</p>
<div class="code bash highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">airflow</span> <span class="n">scheduler</span>
</pre>
</div>
</div>
<div class="section" id="dag-runs">
<h2 class="sigil_not_in_toc">DAG Runs</h2>
<p>A DAG Run is an object representing an instantiation of the DAG in time.</p>
<p>Each DAG may or may not have a schedule, which informs how <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Runs</span></code> are
created. <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> is defined as a DAG arguments, and receives
preferably a
<a class="reference external" href="https://en.wikipedia.org/wiki/Cron#CRON_expression">cron expression</a> as
a <code class="docutils literal notranslate"><span class="pre">str</span></code>, or a <code class="docutils literal notranslate"><span class="pre">datetime.timedelta</span></code> object. Alternatively, you can also
use one of these cron &#x201C;preset&#x201D;:</p>
<table border="1" class="docutils">
<colgroup>
<col width="15%">
<col width="69%">
<col width="16%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">preset</th>
<th class="head">meaning</th>
<th class="head">cron</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">None</span></code></td>
<td>Don&#x2019;t schedule, use for exclusively &#x201C;externally triggered&#x201D;
DAGs</td>
<td>&#xA0;</td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">@once</span></code></td>
<td>Schedule once and only once</td>
<td>&#xA0;</td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">@hourly</span></code></td>
<td>Run once an hour at the beginning of the hour</td>
<td><code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">*</span> <span class="pre">*</span> <span class="pre">*</span> <span class="pre">*</span></code></td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">@daily</span></code></td>
<td>Run once a day at midnight</td>
<td><code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">0</span> <span class="pre">*</span> <span class="pre">*</span> <span class="pre">*</span></code></td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">@weekly</span></code></td>
<td>Run once a week at midnight on Sunday morning</td>
<td><code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">0</span> <span class="pre">*</span> <span class="pre">*</span> <span class="pre">0</span></code></td>
</tr>
<tr class="row-odd"><td><code class="docutils literal notranslate"><span class="pre">@monthly</span></code></td>
<td>Run once a month at midnight of the first day of the month</td>
<td><code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">0</span> <span class="pre">1</span> <span class="pre">*</span> <span class="pre">*</span></code></td>
</tr>
<tr class="row-even"><td><code class="docutils literal notranslate"><span class="pre">@yearly</span></code></td>
<td>Run once a year at midnight of January 1</td>
<td><code class="docutils literal notranslate"><span class="pre">0</span> <span class="pre">0</span> <span class="pre">1</span> <span class="pre">1</span> <span class="pre">*</span></code></td>
</tr>
</tbody>
</table>
<p>Your DAG will be instantiated
for each schedule, while creating a <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Run</span></code> entry for each schedule.</p>
<p>DAG runs have a state associated to them (running, failed, success) and
informs the scheduler on which set of schedules should be evaluated for
task submissions. Without the metadata at the DAG run level, the Airflow
scheduler would have much more work to do in order to figure out what tasks
should be triggered and come to a crawl. It might also create undesired
processing when changing the shape of your DAG, by say adding in new
tasks.</p>
</div>
<div class="section" id="backfill-and-catchup">
<h2 class="sigil_not_in_toc">Backfill and Catchup</h2>
<p>An Airflow DAG with a <code class="docutils literal notranslate"><span class="pre">start_date</span></code>, possibly an <code class="docutils literal notranslate"><span class="pre">end_date</span></code>, and a <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> defines a
series of intervals which the scheduler turn into individual Dag Runs and execute. A key capability of
Airflow is that these DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine
the lifetime of the DAG (from start to end/now, one interval at a time) and kick off a DAG Run for any
interval that has not been run (or has been cleared). This concept is called Catchup.</p>
<p>If your DAG is written to handle its own catchup (IE not limited to the interval, but instead to &#x201C;Now&#x201D;
for instance.), then you will want to turn catchup off (Either on the DAG itself with <code class="docutils literal notranslate"><span class="pre">dag.catchup</span> <span class="pre">=</span>
<span class="pre">False</span></code>) or by default at the configuration file level with <code class="docutils literal notranslate"><span class="pre">catchup_by_default</span> <span class="pre">=</span> <span class="pre">False</span></code>. What this
will do, is to instruct the scheduler to only create a DAG Run for the most current instance of the DAG
interval series.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="sd">&quot;&quot;&quot;</span>
<span class="sd">Code that goes along with the Airflow tutorial located at:</span>
<span class="sd">https://github.com/airbnb/airflow/blob/master/airflow/example_dags/tutorial.py</span>
<span class="sd">&quot;&quot;&quot;</span>
<span class="kn">from</span> <span class="nn">airflow</span> <span class="k">import</span> <span class="n">DAG</span>
<span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="k">import</span> <span class="n">BashOperator</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="k">import</span> <span class="n">datetime</span><span class="p">,</span> <span class="n">timedelta</span>
<span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">&apos;owner&apos;</span><span class="p">:</span> <span class="s1">&apos;airflow&apos;</span><span class="p">,</span>
<span class="s1">&apos;depends_on_past&apos;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
<span class="s1">&apos;start_date&apos;</span><span class="p">:</span> <span class="n">datetime</span><span class="p">(</span><span class="mi">2015</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="s1">&apos;email&apos;</span><span class="p">:</span> <span class="p">[</span><span class="s1">&apos;airflow@example.com&apos;</span><span class="p">],</span>
<span class="s1">&apos;email_on_failure&apos;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
<span class="s1">&apos;email_on_retry&apos;</span><span class="p">:</span> <span class="kc">False</span><span class="p">,</span>
<span class="s1">&apos;retries&apos;</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s1">&apos;retry_delay&apos;</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span>
<span class="s1">&apos;schedule_interval&apos;</span><span class="p">:</span> <span class="s1">&apos;@hourly&apos;</span><span class="p">,</span>
<span class="p">}</span>
<span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="s1">&apos;tutorial&apos;</span><span class="p">,</span> <span class="n">catchup</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">)</span>
</pre>
</div>
</div>
<p>In the example above, if the DAG is picked up by the scheduler daemon on 2016-01-02 at 6 AM, (or from the
command line), a single DAG Run will be created, with an <code class="docutils literal notranslate"><span class="pre">execution_date</span></code> of 2016-01-01, and the next
one will be created just after midnight on the morning of 2016-01-03 with an execution date of 2016-01-02.</p>
<p>If the <code class="docutils literal notranslate"><span class="pre">dag.catchup</span></code> value had been True instead, the scheduler would have created a DAG Run for each
completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval
hasn&#x2019;t completed) and the scheduler will execute them sequentially. This behavior is great for atomic
datasets that can easily be split into periods. Turning catchup off is great if your DAG Runs perform
backfill internally.</p>
</div>
<div class="section" id="external-triggers">
<h2 class="sigil_not_in_toc">External Triggers</h2>
<p>Note that <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Runs</span></code> can also be created manually through the CLI while
running an <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">trigger_dag</span></code> command, where you can define a
specific <code class="docutils literal notranslate"><span class="pre">run_id</span></code>. The <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Runs</span></code> created externally to the
scheduler get associated to the trigger&#x2019;s timestamp, and will be displayed
in the UI alongside scheduled <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">runs</span></code>.</p>
<p>In addition, you can also manually trigger a <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Run</span></code> using the web UI (tab &#x201C;DAGs&#x201D; -&gt; column &#x201C;Links&#x201D; -&gt; button &#x201C;Trigger Dag&#x201D;).</p>
</div>
<div class="section" id="to-keep-in-mind">
<h2 class="sigil_not_in_toc">To Keep in Mind</h2>
<ul class="simple">
<li>The first <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Run</span></code> is created based on the minimum <code class="docutils literal notranslate"><span class="pre">start_date</span></code> for the
tasks in your DAG.</li>
<li>Subsequent <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Runs</span></code> are created by the scheduler process, based on
your DAG&#x2019;s <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code>, sequentially.</li>
<li>When clearing a set of tasks&#x2019; state in hope of getting them to re-run,
it is important to keep in mind the <code class="docutils literal notranslate"><span class="pre">DAG</span> <span class="pre">Run</span></code>&#x2019;s state too as it defines
whether the scheduler should look into triggering tasks for that run.</li>
</ul>
<p>Here are some of the ways you can <strong>unblock tasks</strong>:</p>
<ul class="simple">
<li>From the UI, you can <strong>clear</strong> (as in delete the status of) individual task instances
from the task instances dialog, while defining whether you want to includes the past/future
and the upstream/downstream dependencies. Note that a confirmation window comes next and
allows you to see the set you are about to clear. You can also clear all task instances
associated with the dag.</li>
<li>The CLI command <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">clear</span> <span class="pre">-h</span></code> has lots of options when it comes to clearing task instance
states, including specifying date ranges, targeting task_ids by specifying a regular expression,
flags for including upstream and downstream relatives, and targeting task instances in specific
states (<code class="docutils literal notranslate"><span class="pre">failed</span></code>, or <code class="docutils literal notranslate"><span class="pre">success</span></code>)</li>
<li>Clearing a task instance will no longer delete the task instance record. Instead it updates
max_tries and set the current task instance state to be None.</li>
<li>Marking task instances as failed can be done through the UI. This can be used to stop running task instances.</li>
<li>Marking task instances as successful can be done through the UI. This is mostly to fix false negatives,
or for instance when the fix has been applied outside of Airflow.</li>
<li>The <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">backfill</span></code> CLI subcommand has a flag to <code class="docutils literal notranslate"><span class="pre">--mark_success</span></code> and allows selecting
subsections of the DAG as well as specifying date ranges.</li>
</ul>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Plugins</h1>
<p>Airflow has a simple plugin manager built-in that can integrate external
features to its core by simply dropping files in your
<code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/plugins</span></code> folder.</p>
<p>The python modules in the <code class="docutils literal notranslate"><span class="pre">plugins</span></code> folder get imported,
and <strong>hooks</strong>, <strong>operators</strong>, <strong>sensors</strong>, <strong>macros</strong>, <strong>executors</strong> and web <strong>views</strong>
get integrated to Airflow&#x2019;s main collections and become available for use.</p>
<div class="section" id="what-for">
<h2 class="sigil_not_in_toc">What for?</h2>
<p>Airflow offers a generic toolbox for working with data. Different
organizations have different stacks and different needs. Using Airflow
plugins can be a way for companies to customize their Airflow installation
to reflect their ecosystem.</p>
<p>Plugins can be used as an easy way to write, share and activate new sets of
features.</p>
<p>There&#x2019;s also a need for a set of more complex applications to interact with
different flavors of data and metadata.</p>
<p>Examples:</p>
<ul class="simple">
<li>A set of tools to parse Hive logs and expose Hive metadata (CPU /IO / phases/ skew /&#x2026;)</li>
<li>An anomaly detection framework, allowing people to collect metrics, set thresholds and alerts</li>
<li>An auditing tool, helping understand who accesses what</li>
<li>A config-driven SLA monitoring tool, allowing you to set monitored tables and at what time
they should land, alert people, and expose visualizations of outages</li>
<li>&#x2026;</li>
</ul>
</div>
<div class="section" id="why-build-on-top-of-airflow">
<h2 class="sigil_not_in_toc">Why build on top of Airflow?</h2>
<p>Airflow has many components that can be reused when building an application:</p>
<ul class="simple">
<li>A web server you can use to render your views</li>
<li>A metadata database to store your models</li>
<li>Access to your databases, and knowledge of how to connect to them</li>
<li>An array of workers that your application can push workload to</li>
<li>Airflow is deployed, you can just piggy back on its deployment logistics</li>
<li>Basic charting capabilities, underlying libraries and abstractions</li>
</ul>
</div>
<div class="section" id="interface">
<h2 class="sigil_not_in_toc">Interface</h2>
<p>To create a plugin you will need to derive the
<code class="docutils literal notranslate"><span class="pre">airflow.plugins_manager.AirflowPlugin</span></code> class and reference the objects
you want to plug into Airflow. Here&#x2019;s what the class you need to derive
looks like:</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">AirflowPlugin</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="c1"># The name of your plugin (str)</span>
<span class="n">name</span> <span class="o">=</span> <span class="kc">None</span>
<span class="c1"># A list of class(es) derived from BaseOperator</span>
<span class="n">operators</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of class(es) derived from BaseSensorOperator</span>
<span class="n">sensors</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of class(es) derived from BaseHook</span>
<span class="n">hooks</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of class(es) derived from BaseExecutor</span>
<span class="n">executors</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of references to inject into the macros namespace</span>
<span class="n">macros</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of objects created from a class derived</span>
<span class="c1"># from flask_admin.BaseView</span>
<span class="n">admin_views</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of Blueprint object created from flask.Blueprint</span>
<span class="n">flask_blueprints</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># A list of menu links (flask_admin.base.MenuLink)</span>
<span class="n">menu_links</span> <span class="o">=</span> <span class="p">[]</span>
</pre>
</div>
</div>
<p>You can derive it by inheritance (please refer to the example below).
Please note <code class="docutils literal notranslate"><span class="pre">name</span></code> inside this class must be specified.</p>
<p>After the plugin is imported into Airflow,
you can invoke it using statement like</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">airflow.</span><span class="p">{</span><span class="nb">type</span><span class="p">,</span> <span class="n">like</span> <span class="s2">&quot;operators&quot;</span><span class="p">,</span> <span class="s2">&quot;sensors&quot;</span><span class="p">}</span><span class="o">.</span><span class="p">{</span><span class="n">name</span> <span class="n">specificed</span> <span class="n">inside</span> <span class="n">the</span> <span class="n">plugin</span> <span class="n">class</span><span class="p">}</span> <span class="kn">import</span> <span class="o">*</span>
</pre>
</div>
</div>
<p>When you write your own plugins, make sure you understand them well.
There are some essential properties for each type of plugin.
For example,</p>
<ul class="simple">
<li>For <code class="docutils literal notranslate"><span class="pre">Operator</span></code> plugin, an <code class="docutils literal notranslate"><span class="pre">execute</span></code> method is compulsory.</li>
<li>For <code class="docutils literal notranslate"><span class="pre">Sensor</span></code> plugin, a <code class="docutils literal notranslate"><span class="pre">poke</span></code> method returning a Boolean value is compulsory.</li>
</ul>
</div>
<div class="section" id="example">
<h2 class="sigil_not_in_toc">Example</h2>
<p>The code below defines a plugin that injects a set of dummy object
definitions in Airflow.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1"># This is the class you derive to create a plugin</span>
<span class="kn">from</span> <span class="nn">airflow.plugins_manager</span> <span class="k">import</span> <span class="n">AirflowPlugin</span>
<span class="kn">from</span> <span class="nn">flask</span> <span class="k">import</span> <span class="n">Blueprint</span>
<span class="kn">from</span> <span class="nn">flask_admin</span> <span class="k">import</span> <span class="n">BaseView</span><span class="p">,</span> <span class="n">expose</span>
<span class="kn">from</span> <span class="nn">flask_admin.base</span> <span class="k">import</span> <span class="n">MenuLink</span>
<span class="c1"># Importing base classes that we need to derive</span>
<span class="kn">from</span> <span class="nn">airflow.hooks.base_hook</span> <span class="k">import</span> <span class="n">BaseHook</span>
<span class="kn">from</span> <span class="nn">airflow.models</span> <span class="k">import</span> <span class="n">BaseOperator</span>
<span class="kn">from</span> <span class="nn">airflow.sensors.base_sensor_operator</span> <span class="k">import</span> <span class="n">BaseSensorOperator</span>
<span class="kn">from</span> <span class="nn">airflow.executors.base_executor</span> <span class="k">import</span> <span class="n">BaseExecutor</span>
<span class="c1"># Will show up under airflow.hooks.test_plugin.PluginHook</span>
<span class="k">class</span> <span class="nc">PluginHook</span><span class="p">(</span><span class="n">BaseHook</span><span class="p">):</span>
<span class="k">pass</span>
<span class="c1"># Will show up under airflow.operators.test_plugin.PluginOperator</span>
<span class="k">class</span> <span class="nc">PluginOperator</span><span class="p">(</span><span class="n">BaseOperator</span><span class="p">):</span>
<span class="k">pass</span>
<span class="c1"># Will show up under airflow.sensors.test_plugin.PluginSensorOperator</span>
<span class="k">class</span> <span class="nc">PluginSensorOperator</span><span class="p">(</span><span class="n">BaseSensorOperator</span><span class="p">):</span>
<span class="k">pass</span>
<span class="c1"># Will show up under airflow.executors.test_plugin.PluginExecutor</span>
<span class="k">class</span> <span class="nc">PluginExecutor</span><span class="p">(</span><span class="n">BaseExecutor</span><span class="p">):</span>
<span class="k">pass</span>
<span class="c1"># Will show up under airflow.macros.test_plugin.plugin_macro</span>
<span class="k">def</span> <span class="nf">plugin_macro</span><span class="p">():</span>
<span class="k">pass</span>
<span class="c1"># Creating a flask admin BaseView</span>
<span class="k">class</span> <span class="nc">TestView</span><span class="p">(</span><span class="n">BaseView</span><span class="p">):</span>
<span class="nd">@expose</span><span class="p">(</span><span class="s1">&apos;/&apos;</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">test</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># in this example, put your test_plugin/test.html template at airflow/plugins/templates/test_plugin/test.html</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">render</span><span class="p">(</span><span class="s2">&quot;test_plugin/test.html&quot;</span><span class="p">,</span> <span class="n">content</span><span class="o">=</span><span class="s2">&quot;Hello galaxy!&quot;</span><span class="p">)</span>
<span class="n">v</span> <span class="o">=</span> <span class="n">TestView</span><span class="p">(</span><span class="n">category</span><span class="o">=</span><span class="s2">&quot;Test Plugin&quot;</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s2">&quot;Test View&quot;</span><span class="p">)</span>
<span class="c1"># Creating a flask blueprint to integrate the templates and static folder</span>
<span class="n">bp</span> <span class="o">=</span> <span class="n">Blueprint</span><span class="p">(</span>
<span class="s2">&quot;test_plugin&quot;</span><span class="p">,</span> <span class="vm">__name__</span><span class="p">,</span>
<span class="n">template_folder</span><span class="o">=</span><span class="s1">&apos;templates&apos;</span><span class="p">,</span> <span class="c1"># registers airflow/plugins/templates as a Jinja template folder</span>
<span class="n">static_folder</span><span class="o">=</span><span class="s1">&apos;static&apos;</span><span class="p">,</span>
<span class="n">static_url_path</span><span class="o">=</span><span class="s1">&apos;/static/test_plugin&apos;</span><span class="p">)</span>
<span class="n">ml</span> <span class="o">=</span> <span class="n">MenuLink</span><span class="p">(</span>
<span class="n">category</span><span class="o">=</span><span class="s1">&apos;Test Plugin&apos;</span><span class="p">,</span>
<span class="n">name</span><span class="o">=</span><span class="s1">&apos;Test Menu Link&apos;</span><span class="p">,</span>
<span class="n">url</span><span class="o">=</span><span class="s1">&apos;https://airflow.incubator.apache.org/&apos;</span><span class="p">)</span>
<span class="c1"># Defining the plugin class</span>
<span class="k">class</span> <span class="nc">AirflowTestPlugin</span><span class="p">(</span><span class="n">AirflowPlugin</span><span class="p">):</span>
<span class="n">name</span> <span class="o">=</span> <span class="s2">&quot;test_plugin&quot;</span>
<span class="n">operators</span> <span class="o">=</span> <span class="p">[</span><span class="n">PluginOperator</span><span class="p">]</span>
<span class="n">sensors</span> <span class="o">=</span> <span class="p">[</span><span class="n">PluginSensorOperator</span><span class="p">]</span>
<span class="n">hooks</span> <span class="o">=</span> <span class="p">[</span><span class="n">PluginHook</span><span class="p">]</span>
<span class="n">executors</span> <span class="o">=</span> <span class="p">[</span><span class="n">PluginExecutor</span><span class="p">]</span>
<span class="n">macros</span> <span class="o">=</span> <span class="p">[</span><span class="n">plugin_macro</span><span class="p">]</span>
<span class="n">admin_views</span> <span class="o">=</span> <span class="p">[</span><span class="n">v</span><span class="p">]</span>
<span class="n">flask_blueprints</span> <span class="o">=</span> <span class="p">[</span><span class="n">bp</span><span class="p">]</span>
<span class="n">menu_links</span> <span class="o">=</span> <span class="p">[</span><span class="n">ml</span><span class="p">]</span>
</pre>
</div>
</div>
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Time zones</h1>
<p>Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database.
It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not convert them to the
end user&#x2019;s time zone in the user interface. There it will always be displayed in UTC. Also templates used in Operators
are not converted. Time zone information is exposed and it is up to the writer of DAG what do with it.</p>
<p>This is handy if your users live in more than one time zone and you want to display datetime information according to
each user&#x2019;s wall clock.</p>
<p>Even if you are running Airflow in only one time zone it is still good practice to store data in UTC in your database
(also before Airflow became time zone aware this was also to recommended or even required setup). The main reason is
Daylight Saving Time (DST). Many countries have a system of DST, where clocks are moved forward in spring and backward
in autumn. If you&#x2019;re working in local time, you&#x2019;re likely to encounter errors twice a year, when the transitions
happen. (The pendulum and pytz documentation discusses these issues in greater detail.) This probably doesn&#x2019;t matter
for a simple DAG, but it&#x2019;s a problem if you are in, for example, financial services where you have end of day
deadlines to meet.</p>
<p>The time zone is set in <cite>airflow.cfg</cite>. By default it is set to utc, but you change it to use the system&#x2019;s settings or
an arbitrary IANA time zone, e.g. <cite>Europe/Amsterdam</cite>. It is dependent on <cite>pendulum</cite>, which is more accurate than <cite>pytz</cite>.
Pendulum is installed when you install Airflow.</p>
<p>Please note that the Web UI currently only runs in UTC.</p>
<div class="section" id="concepts">
<h2 class="sigil_not_in_toc">Concepts</h2>
<div class="section" id="naive-and-aware-datetime-objects">
<h3 class="sigil_not_in_toc">Na&#xEF;ve and aware datetime objects</h3>
<p>Python&#x2019;s datetime.datetime objects have a tzinfo attribute that can be used to store time zone information,
represented as an instance of a subclass of datetime.tzinfo. When this attribute is set and describes an offset,
a datetime object is aware. Otherwise, it&#x2019;s naive.</p>
<p>You can use timezone.is_aware() and timezone.is_naive() to determine whether datetimes are aware or naive.</p>
<p>Because Airflow uses time-zone-aware datetime objects. If your code creates datetime objects they need to be aware too.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">airflow.utils</span> <span class="k">import</span> <span class="n">timezone</span>
<span class="n">now</span> <span class="o">=</span> <span class="n">timezone</span><span class="o">.</span><span class="n">utcnow</span><span class="p">()</span>
<span class="n">a_date</span> <span class="o">=</span> <span class="n">timezone</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2017</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="section" id="interpretation-of-naive-datetime-objects">
<h3 class="sigil_not_in_toc">Interpretation of naive datetime objects</h3>
<p>Although Airflow operates fully time zone aware, it still accepts naive date time objects for <cite>start_dates</cite>
and <cite>end_dates</cite> in your DAG definitions. This is mostly in order to preserve backwards compatibility. In
case a naive <cite>start_date</cite> or <cite>end_date</cite> is encountered the default time zone is applied. It is applied
in such a way that it is assumed that the naive date time is already in the default time zone. In other
words if you have a default time zone setting of <cite>Europe/Amsterdam</cite> and create a naive datetime <cite>start_date</cite> of
<cite>datetime(2017,1,1)</cite> it is assumed to be a <cite>start_date</cite> of Jan 1, 2017 Amsterdam time.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">default_args</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span>
<span class="n">start_date</span><span class="o">=</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="n">owner</span><span class="o">=</span><span class="s1">&apos;Airflow&apos;</span>
<span class="p">)</span>
<span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="s1">&apos;my_dag&apos;</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">)</span>
<span class="n">op</span> <span class="o">=</span> <span class="n">DummyOperator</span><span class="p">(</span><span class="n">task_id</span><span class="o">=</span><span class="s1">&apos;dummy&apos;</span><span class="p">,</span> <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">op</span><span class="o">.</span><span class="n">owner</span><span class="p">)</span> <span class="c1"># Airflow</span>
</pre>
</div>
</div>
<p>Unfortunately, during DST transitions, some datetimes don&#x2019;t exist or are ambiguous.
In such situations, pendulum raises an exception. That&#x2019;s why you should always create aware
datetime objects when time zone support is enabled.</p>
<p>In practice, this is rarely an issue. Airflow gives you aware datetime objects in the models and DAGs, and most often,
new datetime objects are created from existing ones through timedelta arithmetic. The only datetime that&#x2019;s often
created in application code is the current time, and timezone.utcnow() automatically does the right thing.</p>
</div>
<div class="section" id="default-time-zone">
<h3 class="sigil_not_in_toc">Default time zone</h3>
<p>The default time zone is the time zone defined by the <cite>default_timezone</cite> setting under <cite>[core]</cite>. If
you just installed Airflow it will be set to <cite>utc</cite>, which is recommended. You can also set it to
<cite>system</cite> or an IANA time zone (e.g.`Europe/Amsterdam`). DAGs are also evaluated on Airflow workers,
it is therefore important to make sure this setting is equal on all Airflow nodes.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">core</span><span class="p">]</span>
<span class="n">default_timezone</span> <span class="o">=</span> <span class="n">utc</span>
</pre>
</div>
</div>
</div>
</div>
<div class="section" id="time-zone-aware-dags">
<h2 class="sigil_not_in_toc">Time zone aware DAGs</h2>
<p>Creating a time zone aware DAG is quite simple. Just make sure to supply a time zone aware <cite>start_date</cite>. It is
recommended to use <cite>pendulum</cite> for this, but <cite>pytz</cite> (to be installed manually) can also be used for this.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pendulum</span>
<span class="n">local_tz</span> <span class="o">=</span> <span class="n">pendulum</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s2">&quot;Europe/Amsterdam&quot;</span><span class="p">)</span>
<span class="n">default_args</span><span class="o">=</span><span class="nb">dict</span><span class="p">(</span>
<span class="n">start_date</span><span class="o">=</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2016</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">tzinfo</span><span class="o">=</span><span class="n">local_tz</span><span class="p">),</span>
<span class="n">owner</span><span class="o">=</span><span class="s1">&apos;Airflow&apos;</span>
<span class="p">)</span>
<span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="s1">&apos;my_tz_dag&apos;</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">)</span>
<span class="n">op</span> <span class="o">=</span> <span class="n">DummyOperator</span><span class="p">(</span><span class="n">task_id</span><span class="o">=</span><span class="s1">&apos;dummy&apos;</span><span class="p">,</span> <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="n">dag</span><span class="o">.</span><span class="n">timezone</span><span class="p">)</span> <span class="c1"># &lt;Timezone [Europe/Amsterdam]&gt;</span>
</pre>
</div>
</div>
<div class="section" id="templates">
<h3 class="sigil_not_in_toc">Templates</h3>
<p>Airflow returns time zone aware datetimes in templates, but does not convert them to local time so they remain in UTC.
It is left up to the DAG to handle this.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pendulum</span>
<span class="n">local_tz</span> <span class="o">=</span> <span class="n">pendulum</span><span class="o">.</span><span class="n">timezone</span><span class="p">(</span><span class="s2">&quot;Europe/Amsterdam&quot;</span><span class="p">)</span>
<span class="n">local_tz</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">execution_date</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="section" id="cron-schedules">
<h3 class="sigil_not_in_toc">Cron schedules</h3>
<p>In case you set a cron schedule, Airflow assumes you will always want to run at the exact same time. It will
then ignore day light savings time. Thus, if you have a schedule that says
run at end of interval every day at 08:00 GMT+1 it will always run end of interval 08:00 GMT+1,
regardless if day light savings time is in place.</p>
</div>
<div class="section" id="time-deltas">
<h3 class="sigil_not_in_toc">Time deltas</h3>
<p>For schedules with time deltas Airflow assumes you always will want to run with the specified interval. So if you
specify a timedelta(hours=2) you will always want to run to hours later. In this case day light savings time will
be taken into account.</p>
</div>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Experimental Rest API</h1>
<p>Airflow exposes an experimental Rest API. It is available through the webserver. Endpoints are
available at /api/experimental/. Please note that we expect the endpoint definitions to change.</p>
<div class="section" id="endpoints">
<h2 class="sigil_not_in_toc">Endpoints</h2>
<p>This is a place holder until the swagger definitions are active</p>
<ul class="simple">
<li>/api/experimental/dags/&lt;DAG_ID&gt;/tasks/&lt;TASK_ID&gt; returns info for a task (GET).</li>
<li>/api/experimental/dags/&lt;DAG_ID&gt;/dag_runs creates a dag_run for a given dag id (POST).</li>
</ul>
</div>
<div class="section" id="cli">
<h2 class="sigil_not_in_toc">CLI</h2>
<p>For some functions the cli can use the API. To configure the CLI to use the API when available
configure as follows:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>cli<span class="o">]</span>
<span class="nv">api_client</span> <span class="o">=</span> airflow.api.client.json_client
<span class="nv">endpoint_url</span> <span class="o">=</span> http://&lt;WEBSERVER&gt;:&lt;PORT&gt;
</pre>
</div>
</div>
</div>
<div class="section" id="authentication">
<h2 class="sigil_not_in_toc">Authentication</h2>
<p>Authentication for the API is handled separately to the Web Authentication. The default is to not
require any authentication on the API &#x2013; i.e. wide open by default. This is not recommended if your
Airflow webserver is publicly accessible, and you should probably use the deny all backend:</p>
<div class="highlight-ini notranslate"><div class="highlight"><pre><span></span><span class="k">[api]</span>
<span class="na">auth_backend</span> <span class="o">=</span> <span class="s">airflow.api.auth.backend.deny_all</span>
</pre>
</div>
</div>
<p>Two &#x201C;real&#x201D; methods for authentication are currently supported for the API.</p>
<p>To enabled Password authentication, set the following in the configuration:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>api<span class="o">]</span>
<span class="nv">auth_backend</span> <span class="o">=</span> airflow.contrib.auth.backends.password_auth
</pre>
</div>
</div>
<p>It&#x2019;s usage is similar to the Password Authentication used for the Web interface.</p>
<p>To enable Kerberos authentication, set the following in the configuration:</p>
<div class="highlight-ini notranslate"><div class="highlight"><pre><span></span><span class="k">[api]</span>
<span class="na">auth_backend</span> <span class="o">=</span> <span class="s">airflow.api.auth.backend.kerberos_auth</span>
<span class="k">[kerberos]</span>
<span class="na">keytab</span> <span class="o">=</span> <span class="s">&lt;KEYTAB&gt;</span>
</pre>
</div>
</div>
<p>The Kerberos service is configured as <code class="docutils literal notranslate"><span class="pre">airflow/fully.qualified.domainname@REALM</span></code>. Make sure this
principal exists in the keytab file.</p>
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Lineage</h1>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Lineage support is very experimental and subject to change.</p>
</div>
<p>Airflow can help track origins of data, what happens to it and where it moves over time. This can aid having
audit trails and data governance, but also debugging of data flows.</p>
<p>Airflow tracks data by means of inlets and outlets of the tasks. Let&#x2019;s work from an example and see how it
works.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">airflow.operators.bash_operator</span> <span class="k">import</span> <span class="n">BashOperator</span>
<span class="kn">from</span> <span class="nn">airflow.operators.dummy_operator</span> <span class="k">import</span> <span class="n">DummyOperator</span>
<span class="kn">from</span> <span class="nn">airflow.lineage.datasets</span> <span class="k">import</span> <span class="n">File</span>
<span class="kn">from</span> <span class="nn">airflow.models</span> <span class="k">import</span> <span class="n">DAG</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="k">import</span> <span class="n">timedelta</span>
<span class="n">FILE_CATEGORIES</span> <span class="o">=</span> <span class="p">[</span><span class="s2">&quot;CAT1&quot;</span><span class="p">,</span> <span class="s2">&quot;CAT2&quot;</span><span class="p">,</span> <span class="s2">&quot;CAT3&quot;</span><span class="p">]</span>
<span class="n">args</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">&apos;owner&apos;</span><span class="p">:</span> <span class="s1">&apos;airflow&apos;</span><span class="p">,</span>
<span class="s1">&apos;start_date&apos;</span><span class="p">:</span> <span class="n">airflow</span><span class="o">.</span><span class="n">utils</span><span class="o">.</span><span class="n">dates</span><span class="o">.</span><span class="n">days_ago</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">dag</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span>
<span class="n">dag_id</span><span class="o">=</span><span class="s1">&apos;example_lineage&apos;</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">args</span><span class="p">,</span>
<span class="n">schedule_interval</span><span class="o">=</span><span class="s1">&apos;0 0 * * *&apos;</span><span class="p">,</span>
<span class="n">dagrun_timeout</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">60</span><span class="p">))</span>
<span class="n">f_final</span> <span class="o">=</span> <span class="n">File</span><span class="p">(</span><span class="s2">&quot;/tmp/final&quot;</span><span class="p">)</span>
<span class="n">run_this_last</span> <span class="o">=</span> <span class="n">DummyOperator</span><span class="p">(</span><span class="n">task_id</span><span class="o">=</span><span class="s1">&apos;run_this_last&apos;</span><span class="p">,</span> <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">,</span>
<span class="n">inlets</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;auto&quot;</span><span class="p">:</span> <span class="kc">True</span><span class="p">},</span>
<span class="n">outlets</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;datasets&quot;</span><span class="p">:</span> <span class="p">[</span><span class="n">f_final</span><span class="p">,]})</span>
<span class="n">f_in</span> <span class="o">=</span> <span class="n">File</span><span class="p">(</span><span class="s2">&quot;/tmp/whole_directory/&quot;</span><span class="p">)</span>
<span class="n">outlets</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">FILE_CATEGORIES</span><span class="p">:</span>
<span class="n">f_out</span> <span class="o">=</span> <span class="n">File</span><span class="p">(</span><span class="s2">&quot;/tmp/</span><span class="si">{}</span><span class="s2">/{{{{ execution_date }}}}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">file</span><span class="p">))</span>
<span class="n">outlets</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">f_out</span><span class="p">)</span>
<span class="n">run_this</span> <span class="o">=</span> <span class="n">BashOperator</span><span class="p">(</span>
<span class="n">task_id</span><span class="o">=</span><span class="s1">&apos;run_me_first&apos;</span><span class="p">,</span> <span class="n">bash_command</span><span class="o">=</span><span class="s1">&apos;echo 1&apos;</span><span class="p">,</span> <span class="n">dag</span><span class="o">=</span><span class="n">dag</span><span class="p">,</span>
<span class="n">inlets</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;datasets&quot;</span><span class="p">:</span> <span class="p">[</span><span class="n">f_in</span><span class="p">,]},</span>
<span class="n">outlets</span><span class="o">=</span><span class="p">{</span><span class="s2">&quot;datasets&quot;</span><span class="p">:</span> <span class="n">outlets</span><span class="p">}</span>
<span class="p">)</span>
<span class="n">run_this</span><span class="o">.</span><span class="n">set_downstream</span><span class="p">(</span><span class="n">run_this_last</span><span class="p">)</span>
</pre>
</div>
</div>
<p>Tasks take the parameters <cite>inlets</cite> and <cite>outlets</cite>. Inlets can be manually defined by a list of dataset <cite>{&#x201C;datasets&#x201D;:
[dataset1, dataset2]}</cite> or can be configured to look for outlets from upstream tasks <cite>{&#x201C;task_ids&#x201D;: [&#x201C;task_id1&#x201D;, &#x201C;task_id2&#x201D;]}</cite>
or can be configured to pick up outlets from direct upstream tasks <cite>{&#x201C;auto&#x201D;: True}</cite> or a combination of them. Outlets
are defined as list of dataset <cite>{&#x201C;datasets&#x201D;: [dataset1, dataset2]}</cite>. Any fields for the dataset are templated with
the context when the task is being executed.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Operators can add inlets and outlets automatically if the operator supports it.</p>
</div>
<p>In the example DAG task <cite>run_me_first</cite> is a BashOperator that takes 3 inlets: <cite>CAT1</cite>, <cite>CAT2</cite>, <cite>CAT3</cite>, that are
generated from a list. Note that <cite>execution_date</cite> is a templated field and will be rendered when the task is running.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">Behind the scenes Airflow prepares the lineage metadata as part of the <cite>pre_execute</cite> method of a task. When the task
has finished execution <cite>post_execute</cite> is called and lineage metadata is pushed into XCOM. Thus if you are creating
your own operators that override this method make sure to decorate your method with <cite>prepare_lineage</cite> and <cite>apply_lineage</cite>
respectively.</p>
</div>
<div class="section" id="apache-atlas">
<h2 class="sigil_not_in_toc">Apache Atlas</h2>
<p>Airflow can send its lineage metadata to Apache Atlas. You need to enable the <cite>atlas</cite> backend and configure it
properly, e.g. in your <cite>airflow.cfg</cite>:</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="p">[</span><span class="n">lineage</span><span class="p">]</span>
<span class="n">backend</span> <span class="o">=</span> <span class="n">airflow</span><span class="o">.</span><span class="n">lineage</span><span class="o">.</span><span class="n">backend</span><span class="o">.</span><span class="n">atlas</span>
<span class="p">[</span><span class="n">atlas</span><span class="p">]</span>
<span class="n">username</span> <span class="o">=</span> <span class="n">my_username</span>
<span class="n">password</span> <span class="o">=</span> <span class="n">my_password</span>
<span class="n">host</span> <span class="o">=</span> <span class="n">host</span>
<span class="n">port</span> <span class="o">=</span> <span class="mi">21000</span>
</pre>
</div>
</div>
<p>Please make sure to have the <cite>atlasclient</cite> package installed.</p>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Quick Start</h1>
<p>The installation is quick and straightforward.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># airflow needs a home, ~/airflow is the default,</span>
<span class="c1"># but you can lay foundation somewhere else if you prefer</span>
<span class="c1"># (optional)</span>
<span class="nb">export</span> <span class="nv">AIRFLOW_HOME</span><span class="o">=</span>~/airflow
<span class="c1"># install from pypi using pip</span>
pip install apache-airflow
<span class="c1"># initialize the database</span>
airflow initdb
<span class="c1"># start the web server, default port is 8080</span>
airflow webserver -p <span class="m">8080</span>
<span class="c1"># start the scheduler</span>
airflow scheduler
<span class="c1"># visit localhost:8080 in the browser and enable the example dag in the home page</span>
</pre>
</div>
</div>
<p>Upon running these commands, Airflow will create the <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME</span></code> folder
and lay an &#x201C;airflow.cfg&#x201D; file with defaults that get you going fast. You can
inspect the file either in <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/airflow.cfg</span></code>, or through the UI in
the <code class="docutils literal notranslate"><span class="pre">Admin-&gt;Configuration</span></code> menu. The PID file for the webserver will be stored
in <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/airflow-webserver.pid</span></code> or in <code class="docutils literal notranslate"><span class="pre">/run/airflow/webserver.pid</span></code>
if started by systemd.</p>
<p>Out of the box, Airflow uses a sqlite database, which you should outgrow
fairly quickly since no parallelization is possible using this database
backend. It works in conjunction with the <code class="docutils literal notranslate"><span class="pre">SequentialExecutor</span></code> which will
only run task instances sequentially. While this is very limiting, it allows
you to get up and running quickly and take a tour of the UI and the
command line utilities.</p>
<p>Here are a few commands that will trigger a few task instances. You should
be able to see the status of the jobs change in the <code class="docutils literal notranslate"><span class="pre">example1</span></code> DAG as you
run the commands below.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># run your first task instance</span>
airflow run example_bash_operator runme_0 <span class="m">2015</span>-01-01
<span class="c1"># run a backfill over 2 days</span>
airflow backfill example_bash_operator -s <span class="m">2015</span>-01-01 -e <span class="m">2015</span>-01-02
</pre>
</div>
</div>
<div class="section" id="what-s-next">
<h2 class="sigil_not_in_toc">What&#x2019;s Next?</h2>
<p>From this point, you can head to the <a class="reference internal" href="tutorial.html"><span class="doc">Tutorial</span></a> section for further examples or the <a class="reference internal" href="howto/index.html"><span class="doc">How-to Guides</span></a> section if you&#x2019;re ready to get your hands dirty.</p>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>FAQ</h1>
<div class="section" id="why-isn-t-my-task-getting-scheduled">
<h2 class="sigil_not_in_toc">Why isn&#x2019;t my task getting scheduled?</h2>
<p>There are very many reasons why your task might not be getting scheduled.
Here are some of the common causes:</p>
<ul class="simple">
<li>Does your script &#x201C;compile&#x201D;, can the Airflow engine parse it and find your
DAG object. To test this, you can run <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">list_dags</span></code> and
confirm that your DAG shows up in the list. You can also run
<code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">list_tasks</span> <span class="pre">foo_dag_id</span> <span class="pre">--tree</span></code> and confirm that your task
shows up in the list as expected. If you use the CeleryExecutor, you
may want to confirm that this works both where the scheduler runs as well
as where the worker runs.</li>
<li>Does the file containing your DAG contain the string &#x201C;airflow&#x201D; and &#x201C;DAG&#x201D; somewhere
in the contents? When searching the DAG directory, Airflow ignores files not containing
&#x201C;airflow&#x201D; and &#x201C;DAG&#x201D; in order to prevent the DagBag parsing from importing all python
files collocated with user&#x2019;s DAGs.</li>
<li>Is your <code class="docutils literal notranslate"><span class="pre">start_date</span></code> set properly? The Airflow scheduler triggers the
task soon after the <code class="docutils literal notranslate"><span class="pre">start_date</span> <span class="pre">+</span> <span class="pre">scheduler_interval</span></code> is passed.</li>
<li>Is your <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> set properly? The default <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code>
is one day (<code class="docutils literal notranslate"><span class="pre">datetime.timedelta(1)</span></code>). You must specify a different <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code>
directly to the DAG object you instantiate, not as a <code class="docutils literal notranslate"><span class="pre">default_param</span></code>, as task instances
do not override their parent DAG&#x2019;s <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code>.</li>
<li>Is your <code class="docutils literal notranslate"><span class="pre">start_date</span></code> beyond where you can see it in the UI? If you
set your <code class="docutils literal notranslate"><span class="pre">start_date</span></code> to some time say 3 months ago, you won&#x2019;t be able to see
it in the main view in the UI, but you should be able to see it in the
<code class="docutils literal notranslate"><span class="pre">Menu</span> <span class="pre">-&gt;</span> <span class="pre">Browse</span> <span class="pre">-&gt;Task</span> <span class="pre">Instances</span></code>.</li>
<li>Are the dependencies for the task met. The task instances directly
upstream from the task need to be in a <code class="docutils literal notranslate"><span class="pre">success</span></code> state. Also,
if you have set <code class="docutils literal notranslate"><span class="pre">depends_on_past=True</span></code>, the previous task instance
needs to have succeeded (except if it is the first run for that task).
Also, if <code class="docutils literal notranslate"><span class="pre">wait_for_downstream=True</span></code>, make sure you understand
what it means.
You can view how these properties are set from the <code class="docutils literal notranslate"><span class="pre">Task</span> <span class="pre">Instance</span> <span class="pre">Details</span></code>
page for your task.</li>
<li>Are the DagRuns you need created and active? A DagRun represents a specific
execution of an entire DAG and has a state (running, success, failed, &#x2026;).
The scheduler creates new DagRun as it moves forward, but never goes back
in time to create new ones. The scheduler only evaluates <code class="docutils literal notranslate"><span class="pre">running</span></code> DagRuns
to see what task instances it can trigger. Note that clearing tasks
instances (from the UI or CLI) does set the state of a DagRun back to
running. You can bulk view the list of DagRuns and alter states by clicking
on the schedule tag for a DAG.</li>
<li>Is the <code class="docutils literal notranslate"><span class="pre">concurrency</span></code> parameter of your DAG reached? <code class="docutils literal notranslate"><span class="pre">concurrency</span></code> defines
how many <code class="docutils literal notranslate"><span class="pre">running</span></code> task instances a DAG is allowed to have, beyond which
point things get queued.</li>
<li>Is the <code class="docutils literal notranslate"><span class="pre">max_active_runs</span></code> parameter of your DAG reached? <code class="docutils literal notranslate"><span class="pre">max_active_runs</span></code> defines
how many <code class="docutils literal notranslate"><span class="pre">running</span></code> concurrent instances of a DAG there are allowed to be.</li>
</ul>
<p>You may also want to read the Scheduler section of the docs and make
sure you fully understand how it proceeds.</p>
</div>
<div class="section" id="how-do-i-trigger-tasks-based-on-another-task-s-failure">
<h2 class="sigil_not_in_toc">How do I trigger tasks based on another task&#x2019;s failure?</h2>
<p>Check out the <code class="docutils literal notranslate"><span class="pre">Trigger</span> <span class="pre">Rule</span></code> section in the Concepts section of the
documentation</p>
</div>
<div class="section" id="why-are-connection-passwords-still-not-encrypted-in-the-metadata-db-after-i-installed-airflow-crypto">
<h2 class="sigil_not_in_toc">Why are connection passwords still not encrypted in the metadata db after I installed airflow[crypto]?</h2>
<p>Check out the <code class="docutils literal notranslate"><span class="pre">Connections</span></code> section in the Configuration section of the
documentation</p>
</div>
<div class="section" id="what-s-the-deal-with-start-date">
<h2 class="sigil_not_in_toc">What&#x2019;s the deal with <code class="docutils literal notranslate"><span class="pre">start_date</span></code>?</h2>
<p><code class="docutils literal notranslate"><span class="pre">start_date</span></code> is partly legacy from the pre-DagRun era, but it is still
relevant in many ways. When creating a new DAG, you probably want to set
a global <code class="docutils literal notranslate"><span class="pre">start_date</span></code> for your tasks using <code class="docutils literal notranslate"><span class="pre">default_args</span></code>. The first
DagRun to be created will be based on the <code class="docutils literal notranslate"><span class="pre">min(start_date)</span></code> for all your
task. From that point on, the scheduler creates new DagRuns based on
your <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> and the corresponding task instances run as your
dependencies are met. When introducing new tasks to your DAG, you need to
pay special attention to <code class="docutils literal notranslate"><span class="pre">start_date</span></code>, and may want to reactivate
inactive DagRuns to get the new task onboarded properly.</p>
<p>We recommend against using dynamic values as <code class="docutils literal notranslate"><span class="pre">start_date</span></code>, especially
<code class="docutils literal notranslate"><span class="pre">datetime.now()</span></code> as it can be quite confusing. The task is triggered
once the period closes, and in theory an <code class="docutils literal notranslate"><span class="pre">@hourly</span></code> DAG would never get to
an hour after now as <code class="docutils literal notranslate"><span class="pre">now()</span></code> moves along.</p>
<p>Previously we also recommended using rounded <code class="docutils literal notranslate"><span class="pre">start_date</span></code> in relation to your
<code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code>. This meant an <code class="docutils literal notranslate"><span class="pre">@hourly</span></code> would be at <code class="docutils literal notranslate"><span class="pre">00:00</span></code>
minutes:seconds, a <code class="docutils literal notranslate"><span class="pre">@daily</span></code> job at midnight, a <code class="docutils literal notranslate"><span class="pre">@monthly</span></code> job on the
first of the month. This is no longer required. Airflow will now auto align
the <code class="docutils literal notranslate"><span class="pre">start_date</span></code> and the <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code>, by using the <code class="docutils literal notranslate"><span class="pre">start_date</span></code>
as the moment to start looking.</p>
<p>You can use any sensor or a <code class="docutils literal notranslate"><span class="pre">TimeDeltaSensor</span></code> to delay
the execution of tasks within the schedule interval.
While <code class="docutils literal notranslate"><span class="pre">schedule_interval</span></code> does allow specifying a <code class="docutils literal notranslate"><span class="pre">datetime.timedelta</span></code>
object, we recommend using the macros or cron expressions instead, as
it enforces this idea of rounded schedules.</p>
<p>When using <code class="docutils literal notranslate"><span class="pre">depends_on_past=True</span></code> it&#x2019;s important to pay special attention
to <code class="docutils literal notranslate"><span class="pre">start_date</span></code> as the past dependency is not enforced only on the specific
schedule of the <code class="docutils literal notranslate"><span class="pre">start_date</span></code> specified for the task. It&#x2019;s also
important to watch DagRun activity status in time when introducing
new <code class="docutils literal notranslate"><span class="pre">depends_on_past=True</span></code>, unless you are planning on running a backfill
for the new task(s).</p>
<p>Also important to note is that the tasks <code class="docutils literal notranslate"><span class="pre">start_date</span></code>, in the context of a
backfill CLI command, get overridden by the backfill&#x2019;s command <code class="docutils literal notranslate"><span class="pre">start_date</span></code>.
This allows for a backfill on tasks that have <code class="docutils literal notranslate"><span class="pre">depends_on_past=True</span></code> to
actually start, if that wasn&#x2019;t the case, the backfill just wouldn&#x2019;t start.</p>
</div>
<div class="section" id="how-can-i-create-dags-dynamically">
<h2 class="sigil_not_in_toc">How can I create DAGs dynamically?</h2>
<p>Airflow looks in your <code class="docutils literal notranslate"><span class="pre">DAGS_FOLDER</span></code> for modules that contain <code class="docutils literal notranslate"><span class="pre">DAG</span></code> objects
in their global namespace, and adds the objects it finds in the
<code class="docutils literal notranslate"><span class="pre">DagBag</span></code>. Knowing this all we need is a way to dynamically assign
variable in the global namespace, which is easily done in python using the
<code class="docutils literal notranslate"><span class="pre">globals()</span></code> function for the standard library which behaves like a
simple dictionary.</p>
<div class="code python highlight-default notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">):</span>
<span class="n">dag_id</span> <span class="o">=</span> <span class="s1">&apos;foo_</span><span class="si">{}</span><span class="s1">&apos;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="nb">globals</span><span class="p">()[</span><span class="n">dag_id</span><span class="p">]</span> <span class="o">=</span> <span class="n">DAG</span><span class="p">(</span><span class="n">dag_id</span><span class="p">)</span>
<span class="c1"># or better, call a function that returns a DAG object!</span>
</pre>
</div>
</div>
</div>
<div class="section" id="what-are-all-the-airflow-run-commands-in-my-process-list">
<h2 class="sigil_not_in_toc">What are all the <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span></code> commands in my process list?</h2>
<p>There are many layers of <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span></code> commands, meaning it can call itself.</p>
<ul class="simple">
<li>Basic <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span></code>: fires up an executor, and tell it to run an
<code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span> <span class="pre">--local</span></code> command. if using Celery, this means it puts a
command in the queue for it to run remote, on the worker. If using
LocalExecutor, that translates into running it in a subprocess pool.</li>
<li>Local <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span> <span class="pre">--local</span></code>: starts an <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span> <span class="pre">--raw</span></code>
command (described below) as a subprocess and is in charge of
emitting heartbeats, listening for external kill signals
and ensures some cleanup takes place if the subprocess fails</li>
<li>Raw <code class="docutils literal notranslate"><span class="pre">airflow</span> <span class="pre">run</span> <span class="pre">--raw</span></code> runs the actual operator&#x2019;s execute method and
performs the actual work</li>
</ul>
</div>
<div class="section" id="how-can-my-airflow-dag-run-faster">
<h2 class="sigil_not_in_toc">How can my airflow dag run faster?</h2>
<p>There are three variables we could control to improve airflow dag performance:</p>
<ul class="simple">
<li><code class="docutils literal notranslate"><span class="pre">parallelism</span></code>: This variable controls the number of task instances that the airflow worker can run simultaneously. User could increase the parallelism variable in the <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>.</li>
<li><code class="docutils literal notranslate"><span class="pre">concurrency</span></code>: The Airflow scheduler will run no more than <code class="docutils literal notranslate"><span class="pre">$concurrency</span></code> task instances for your DAG at any given time. Concurrency is defined in your Airflow DAG. If you do not set the concurrency on your DAG, the scheduler will use the default value from the <code class="docutils literal notranslate"><span class="pre">dag_concurrency</span></code> entry in your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>.</li>
<li><code class="docutils literal notranslate"><span class="pre">max_active_runs</span></code>: the Airflow scheduler will run no more than <code class="docutils literal notranslate"><span class="pre">max_active_runs</span></code> DagRuns of your DAG at a given time. If you do not set the <code class="docutils literal notranslate"><span class="pre">max_active_runs</span></code> in your DAG, the scheduler will use the default value from the <code class="docutils literal notranslate"><span class="pre">max_active_runs_per_dag</span></code> entry in your <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code>.</li>
</ul>
</div>
<div class="section" id="how-can-we-reduce-the-airflow-ui-page-load-time">
<h2 class="sigil_not_in_toc">How can we reduce the airflow UI page load time?</h2>
<p>If your dag takes long time to load, you could reduce the value of <code class="docutils literal notranslate"><span class="pre">default_dag_run_display_number</span></code> configuration in <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> to a smaller value. This configurable controls the number of dag run to show in UI with default value 25.</p>
</div>
<div class="section" id="how-to-fix-exception-global-variable-explicit-defaults-for-timestamp-needs-to-be-on-1">
<h2 class="sigil_not_in_toc">How to fix Exception: Global variable explicit_defaults_for_timestamp needs to be on (1)?</h2>
<p>This means <code class="docutils literal notranslate"><span class="pre">explicit_defaults_for_timestamp</span></code> is disabled in your mysql server and you need to enable it by:</p>
<ol class="arabic simple">
<li>Set <code class="docutils literal notranslate"><span class="pre">explicit_defaults_for_timestamp</span> <span class="pre">=</span> <span class="pre">1</span></code> under the mysqld section in your my.cnf file.</li>
<li>Restart the Mysql server.</li>
</ol>
</div>
<div class="section" id="how-to-reduce-airflow-dag-scheduling-latency-in-production">
<h2 class="sigil_not_in_toc">How to reduce airflow dag scheduling latency in production?</h2>
<ul class="simple">
<li><code class="docutils literal notranslate"><span class="pre">max_threads</span></code>: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by <code class="docutils literal notranslate"><span class="pre">max_threads</span></code> with default value of 2. User should increase this value to a larger value(e.g numbers of cpus where scheduler runs - 1) in production.</li>
<li><code class="docutils literal notranslate"><span class="pre">scheduler_heartbeat_sec</span></code>: User should consider to increase <code class="docutils literal notranslate"><span class="pre">scheduler_heartbeat_sec</span></code> config to a higher value(e.g 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates the job&#x2019;s entry in database.</li>
</ul>
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Installation</h1>
<div class="section" id="getting-airflow">
<h2 class="sigil_not_in_toc">Getting Airflow</h2>
<p>The easiest way to install the latest stable version of Airflow is with <code class="docutils literal notranslate"><span class="pre">pip</span></code>:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip install apache-airflow
</pre>
</div>
</div>
<p>You can also install Airflow with support for extra features like <code class="docutils literal notranslate"><span class="pre">s3</span></code> or <code class="docutils literal notranslate"><span class="pre">postgres</span></code>:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>pip install apache-airflow<span class="o">[</span>postgres,s3<span class="o">]</span>
</pre>
</div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p>GPL dependency</p>
<p class="last">One of the dependencies of Apache Airflow by default pulls in a GPL library (&#x2018;unidecode&#x2019;).
In case this is a concern you can force a non GPL library by issuing
<code class="docutils literal notranslate"><span class="pre">export</span> <span class="pre">SLUGIFY_USES_TEXT_UNIDECODE=yes</span></code> and then proceed with the normal installation.
Please note that this needs to be specified at every upgrade. Also note that if <cite>unidecode</cite>
is already present on the system the dependency will still be used.</p>
</div>
</div>
<div class="section" id="extra-packages">
<h2 class="sigil_not_in_toc">Extra Packages</h2>
<p>The <code class="docutils literal notranslate"><span class="pre">apache-airflow</span></code> PyPI basic package only installs what&#x2019;s needed to get started.
Subpackages can be installed depending on what will be useful in your
environment. For instance, if you don&#x2019;t need connectivity with Postgres,
you won&#x2019;t have to go through the trouble of installing the <code class="docutils literal notranslate"><span class="pre">postgres-devel</span></code>
yum package, or whatever equivalent applies on the distribution you are using.</p>
<p>Behind the scenes, Airflow does conditional imports of operators that require
these extra dependencies.</p>
<p>Here&#x2019;s the list of the subpackages and what they enable:</p>
<table border="1" class="docutils">
<colgroup>
<col width="14%">
<col width="42%">
<col width="45%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">subpackage</th>
<th class="head">install command</th>
<th class="head">enables</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>all</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[all]</span></code></td>
<td>All Airflow features known to man</td>
</tr>
<tr class="row-odd"><td>all_dbs</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[all_dbs]</span></code></td>
<td>All databases integrations</td>
</tr>
<tr class="row-even"><td>async</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[async]</span></code></td>
<td>Async worker classes for Gunicorn</td>
</tr>
<tr class="row-odd"><td>celery</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[celery]</span></code></td>
<td>CeleryExecutor</td>
</tr>
<tr class="row-even"><td>cloudant</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[cloudant]</span></code></td>
<td>Cloudant hook</td>
</tr>
<tr class="row-odd"><td>crypto</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[crypto]</span></code></td>
<td>Encrypt connection passwords in metadata db</td>
</tr>
<tr class="row-even"><td>devel</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[devel]</span></code></td>
<td>Minimum dev tools requirements</td>
</tr>
<tr class="row-odd"><td>devel_hadoop</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[devel_hadoop]</span></code></td>
<td>Airflow + dependencies on the Hadoop stack</td>
</tr>
<tr class="row-even"><td>druid</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[druid]</span></code></td>
<td>Druid related operators &amp; hooks</td>
</tr>
<tr class="row-odd"><td>gcp_api</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[gcp_api]</span></code></td>
<td>Google Cloud Platform hooks and operators
(using <code class="docutils literal notranslate"><span class="pre">google-api-python-client</span></code>)</td>
</tr>
<tr class="row-even"><td>hdfs</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[hdfs]</span></code></td>
<td>HDFS hooks and operators</td>
</tr>
<tr class="row-odd"><td>hive</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[hive]</span></code></td>
<td>All Hive related operators</td>
</tr>
<tr class="row-even"><td>jdbc</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[jdbc]</span></code></td>
<td>JDBC hooks and operators</td>
</tr>
<tr class="row-odd"><td>kerbero s</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[kerberos]</span></code></td>
<td>Kerberos integration for Kerberized Hadoop</td>
</tr>
<tr class="row-even"><td>ldap</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[ldap]</span></code></td>
<td>LDAP authentication for users</td>
</tr>
<tr class="row-odd"><td>mssql</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[mssql]</span></code></td>
<td>Microsoft SQL Server operators and hook,
support as an Airflow backend</td>
</tr>
<tr class="row-even"><td>mysql</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[mysql]</span></code></td>
<td>MySQL operators and hook, support as an Airflow
backend. The version of MySQL server has to be
5.6.4+. The exact version upper bound depends
on version of <code class="docutils literal notranslate"><span class="pre">mysqlclient</span></code> package. For
example, <code class="docutils literal notranslate"><span class="pre">mysqlclient</span></code> 1.3.12 can only be
used with MySQL server 5.6.4 through 5.7.</td>
</tr>
<tr class="row-odd"><td>password</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[password]</span></code></td>
<td>Password authentication for users</td>
</tr>
<tr class="row-even"><td>postgres</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[postgres]</span></code></td>
<td>PostgreSQL operators and hook, support as an
Airflow backend</td>
</tr>
<tr class="row-odd"><td>qds</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[qds]</span></code></td>
<td>Enable QDS (Qubole Data Service) support</td>
</tr>
<tr class="row-even"><td>rabbitmq</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[rabbitmq]</span></code></td>
<td>RabbitMQ support as a Celery backend</td>
</tr>
<tr class="row-odd"><td>redis</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[redis]</span></code></td>
<td>Redis hooks and sensors</td>
</tr>
<tr class="row-even"><td>s3</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[s3]</span></code></td>
<td><code class="docutils literal notranslate"><span class="pre">S3KeySensor</span></code>, <code class="docutils literal notranslate"><span class="pre">S3PrefixSensor</span></code></td>
</tr>
<tr class="row-odd"><td>samba</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[samba]</span></code></td>
<td><code class="docutils literal notranslate"><span class="pre">Hive2SambaOperator</span></code></td>
</tr>
<tr class="row-even"><td>slack</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[slack]</span></code></td>
<td><code class="docutils literal notranslate"><span class="pre">SlackAPIPostOperator</span></code></td>
</tr>
<tr class="row-odd"><td>ssh</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[ssh]</span></code></td>
<td>SSH hooks and Operator</td>
</tr>
<tr class="row-even"><td>vertica</td>
<td><code class="docutils literal notranslate"><span class="pre">pip</span> <span class="pre">install</span> <span class="pre">apache-airflow[vertica]</span></code></td>
<td>Vertica hook support as an Airflow backend</td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="initiating-airflow-database">
<h2 class="sigil_not_in_toc">Initiating Airflow Database</h2>
<p>Airflow requires a database to be initiated before you can run tasks. If
you&#x2019;re just experimenting and learning Airflow, you can stick with the
default SQLite option. If you don&#x2019;t want to use SQLite, then take a look at
<a class="reference internal" href="howto/initialize-database.html"><span class="doc">Initializing a Database Backend</span></a> to setup a different database.</p>
<p>After configuration, you&#x2019;ll need to initialize the database before you can
run tasks:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>airflow initdb
</pre>
</div>
</div>
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>How-to Guides</h1>
<p>Setting up the sandbox in the <a class="reference internal" href="../start.html"><span class="doc">Quick Start</span></a> section was easy;
building a production-grade environment requires a bit more work!</p>
<p>These how-to guides will step you through common tasks in using and
configuring an Airflow environment.</p>
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="set-config.html">Setting Configuration Options</a></li>
<li class="toctree-l1"><a class="reference internal" href="initialize-database.html">Initializing a Database Backend</a></li>
<li class="toctree-l1"><a class="reference internal" href="operator.html">Using Operators</a><ul>
<li class="toctree-l2"><a class="reference internal" href="operator.html#bashoperator">BashOperator</a></li>
<li class="toctree-l2"><a class="reference internal" href="operator.html#pythonoperator">PythonOperator</a></li>
<li class="toctree-l2"><a class="reference internal" href="operator.html#google-cloud-platform-operators">Google Cloud Platform Operators</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="manage-connections.html">Managing Connections</a><ul>
<li class="toctree-l2"><a class="reference internal" href="manage-connections.html#creating-a-connection-with-the-ui">Creating a Connection with the UI</a></li>
<li class="toctree-l2"><a class="reference internal" href="manage-connections.html#editing-a-connection-with-the-ui">Editing a Connection with the UI</a></li>
<li class="toctree-l2"><a class="reference internal" href="manage-connections.html#creating-a-connection-with-environment-variables">Creating a Connection with Environment Variables</a></li>
<li class="toctree-l2"><a class="reference internal" href="manage-connections.html#connection-types">Connection Types</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="secure-connections.html">Securing Connections</a></li>
<li class="toctree-l1"><a class="reference internal" href="write-logs.html">Writing Logs</a><ul>
<li class="toctree-l2"><a class="reference internal" href="write-logs.html#writing-logs-locally">Writing Logs Locally</a></li>
<li class="toctree-l2"><a class="reference internal" href="write-logs.html#writing-logs-to-amazon-s3">Writing Logs to Amazon S3</a></li>
<li class="toctree-l2"><a class="reference internal" href="write-logs.html#writing-logs-to-azure-blob-storage">Writing Logs to Azure Blob Storage</a></li>
<li class="toctree-l2"><a class="reference internal" href="write-logs.html#writing-logs-to-google-cloud-storage">Writing Logs to Google Cloud Storage</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="executor/use-celery.html">Scaling Out with Celery</a></li>
<li class="toctree-l1"><a class="reference internal" href="executor/use-dask.html">Scaling Out with Dask</a></li>
<li class="toctree-l1"><a class="reference internal" href="executor/use-mesos.html">Scaling Out with Mesos (community contributed)</a><ul>
<li class="toctree-l2"><a class="reference internal" href="executor/use-mesos.html#tasks-executed-directly-on-mesos-slaves">Tasks executed directly on mesos slaves</a></li>
<li class="toctree-l2"><a class="reference internal" href="executor/use-mesos.html#tasks-executed-in-containers-on-mesos-slaves">Tasks executed in containers on mesos slaves</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="run-with-systemd.html">Running Airflow with systemd</a></li>
<li class="toctree-l1"><a class="reference internal" href="run-with-upstart.html">Running Airflow with upstart</a></li>
<li class="toctree-l1"><a class="reference internal" href="use-test-config.html">Using the Test Mode Configuration</a></li>
</ul>
</div>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Setting Configuration Options</h1>
<p>The first time you run Airflow, it will create a file called <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> in
your <code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME</span></code> directory (<code class="docutils literal notranslate"><span class="pre">~/airflow</span></code> by default). This file contains Airflow&#x2019;s configuration and you
can edit it to change any of the settings. You can also set options with environment variables by using this format:
<code class="docutils literal notranslate"><span class="pre">$AIRFLOW__{SECTION}__{KEY}</span></code> (note the double underscores).</p>
<p>For example, the
metadata database connection string can either be set in <code class="docutils literal notranslate"><span class="pre">airflow.cfg</span></code> like this:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>core<span class="o">]</span>
<span class="nv">sql_alchemy_conn</span> <span class="o">=</span> my_conn_string
</pre>
</div>
</div>
<p>or by creating a corresponding environment variable:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">AIRFLOW__CORE__SQL_ALCHEMY_CONN</span><span class="o">=</span>my_conn_string
</pre>
</div>
</div>
<p>You can also derive the connection string at run time by appending <code class="docutils literal notranslate"><span class="pre">_cmd</span></code> to the key like this:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="o">[</span>core<span class="o">]</span>
<span class="nv">sql_alchemy_conn_cmd</span> <span class="o">=</span> bash_command_to_run
</pre>
</div>
</div>
<p>-But only three such configuration elements namely sql_alchemy_conn, broker_url and result_backend can be fetched as a command. The idea behind this is to not store passwords on boxes in plain text files. The order of precedence is as follows -</p>
<ol class="arabic simple">
<li>environment variable</li>
<li>configuration in airflow.cfg</li>
<li>command in airflow.cfg</li>
<li>default</li>
</ol>
</body>
</html>
\ No newline at end of file
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title>
<link href="../style/ebook.css" type="text/css" rel="stylesheet">
</head>
<body>
<h1>Initializing a Database Backend</h1>
<p>If you want to take a real test drive of Airflow, you should consider
setting up a real database backend and switching to the LocalExecutor.</p>
<p>As Airflow was built to interact with its metadata using the great SqlAlchemy
library, you should be able to use any database backend supported as a
SqlAlchemy backend. We recommend using <strong>MySQL</strong> or <strong>Postgres</strong>.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">We rely on more strict ANSI SQL settings for MySQL in order to have
sane defaults. Make sure to have specified <cite>explicit_defaults_for_timestamp=1</cite>
in your my.cnf under <cite>[mysqld]</cite></p>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">If you decide to use <strong>Postgres</strong>, we recommend using the <code class="docutils literal notranslate"><span class="pre">psycopg2</span></code>
driver and specifying it in your SqlAlchemy connection string.
Also note that since SqlAlchemy does not expose a way to target a
specific schema in the Postgres connection URI, you may
want to set a default schema for your role with a
command similar to <code class="docutils literal notranslate"><span class="pre">ALTER</span> <span class="pre">ROLE</span> <span class="pre">username</span> <span class="pre">SET</span> <span class="pre">search_path</span> <span class="pre">=</span> <span class="pre">airflow,</span> <span class="pre">foobar;</span></code></p>
</div>
<p>Once you&#x2019;ve setup your database to host Airflow, you&#x2019;ll need to alter the
SqlAlchemy connection string located in your configuration file
<code class="docutils literal notranslate"><span class="pre">$AIRFLOW_HOME/airflow.cfg</span></code>. You should then also change the &#x201C;executor&#x201D;
setting to use &#x201C;LocalExecutor&#x201D;, an executor that can parallelize task
instances locally.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># initialize the database</span>
airflow initdb
</pre>
</div>
</div>
</body>
</html>
\ No newline at end of file
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册