Airflow can be set up behind a reverse proxy, with the ability to set its endpoint with great flexibility.
For example, you can configure your reverse proxy to get:
https://lab.mycompany.com/myorg/airflow/
To do so, you need to set the following setting in your airflow.cfg:
base_url = http://my_host/myorg/airflow
Additionally if you use Celery Executor, you can get Flower in /myorg/flower with:
flower_url_prefix = /myorg/flower
Your reverse proxy (ex: nginx) should be configured as follow:
pass the url and http header as it for the Airflow webserver, without any rewrite, for example:
server {
listen 80;
server_name lab.mycompany.com;
location /myorg/airflow/ {
proxy_pass http://localhost:8080;
proxy_set_header Host $host;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
rewrite the url for the flower endpoint:
server {
listen 80;
server_name lab.mycompany.com;
location /myorg/flower/ {
rewrite ^/myorg/flower/(.*)$ /$1 break; # remove prefix from http header
proxy_pass http://localhost:5555;
proxy_set_header Host $host;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
}
Airflow has limited support for Microsoft Azure: interfaces exist only for Azure Blob Storage and Azure Data Lake. Hook, Sensor and Operator for Blob Storage and Azure Data Lake Hook are in contrib section.
All classes communicate via the Window Azure Storage Blob protocol. Make sure that a Airflow connection of type wasb exists. Authorization can be done by supplying a login (=Storage account name) and password (=KEY), or login and SAS token in the extra field (see connection wasb_default for an example).
class airflow.contrib.sensors.wasb_sensor.WasbBlobSensor(container_name, blob_name, wasb_conn_id='wasb_default', check_options=None, *args, **kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a blob to arrive on Azure Blob Storage.
Parameters: |
|
---|
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.contrib.sensors.wasb_sensor.WasbPrefixSensor(container_name, prefix, wasb_conn_id='wasb_default', check_options=None, *args, **kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for blobs matching a prefix to arrive on Azure Blob Storage.
Parameters: |
|
---|
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.contrib.operators.file_to_wasb.FileToWasbOperator(file_path, container_name, blob_name, wasb_conn_id='wasb_default', load_options=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Uploads a file to Azure Blob Storage.
Parameters: |
|
---|
execute(context)
Upload a file to Azure Blob Storage.
class airflow.contrib.hooks.wasb_hook.WasbHook(wasb_conn_id='wasb_default')
Bases: airflow.hooks.base_hook.BaseHook
Interacts with Azure Blob Storage through the wasb:// protocol.
Additional options passed in the ‘extra’ field of the connection will be passed to the BlockBlockService() constructor. For example, authenticate using a SAS token by adding {“sas_token”: “YOUR_TOKEN”}.
Parameters: | wasb_conn_id (str) – Reference to the wasb connection. |
---|
check_for_blob(container_name, blob_name, **kwargs)
Check if a blob exists on Azure Blob Storage.
Parameters: |
|
---|---|
Returns: | True if the blob exists, False otherwise. |
:rtype bool
check_for_prefix(container_name, prefix, **kwargs)
Check if a prefix exists on Azure Blob storage.
Parameters: |
|
---|---|
Returns: | True if blobs matching the prefix exist, False otherwise. |
:rtype bool
get_conn()
Return the BlockBlobService object.
get_file(file_path, container_name, blob_name, **kwargs)
Download a file from Azure Blob Storage.
Parameters: |
|
---|
load_file(file_path, container_name, blob_name, **kwargs)
Upload a file to Azure Blob Storage.
Parameters: |
|
---|
load_string(string_data, container_name, blob_name, **kwargs)
Upload a string to Azure Blob Storage.
Parameters: |
|
---|
read_file(container_name, blob_name, **kwargs)
Read a file from Azure Blob Storage and return as a string.
Parameters: |
|
---|
Airflow can be configured to read and write task logs in Azure Blob Storage. See Writing Logs to Azure Blob Storage.
AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a Airflow connection of type azure_data_lake exists. Authorization can be done by supplying a login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name)
(see connection azure_data_lake_default for an example).
class airflow.contrib.hooks.azure_data_lake_hook.AzureDataLakeHook(azure_data_lake_conn_id='azure_data_lake_default')
Bases: airflow.hooks.base_hook.BaseHook
Interacts with Azure Data Lake.
Client ID and client secret should be in user and password parameters. Tenant and account name should be extra field as {“tenant”: “<TENANT>”, “account_name”: “ACCOUNT_NAME”}.
Parameters: | azure_data_lake_conn_id (str) – Reference to the Azure Data Lake connection. |
---|
check_for_file(file_path)
Check if a file exists on Azure Data Lake.
Parameters: | file_path (str) – Path and name of the file. |
---|---|
Returns: | True if the file exists, False otherwise. |
:rtype bool
download_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)
Download a file from Azure Blob Storage.
Parameters: |
|
---|
get_conn()
Return a AzureDLFileSystem object.
upload_file(local_path, remote_path, nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)
Upload a file to Azure Data Lake.
Parameters: |
|
---|
Airflow has extensive support for Amazon Web Services. But note that the Hooks, Sensors and Operators are in the contrib section.
class airflow.contrib.operators.emr_add_steps_operator.EmrAddStepsOperator(job_flow_id, aws_conn_id='s3_default', steps=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
An operator that adds steps to an existing EMR job_flow.
Parameters: |
|
---|
class airflow.contrib.operators.emr_create_job_flow_operator.EmrCreateJobFlowOperator(aws_conn_id='s3_default', emr_conn_id='emr_default', job_flow_overrides=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Creates an EMR JobFlow, reading the config from the EMR connection. A dictionary of JobFlow overrides can be passed that override the config from the connection.
Parameters: |
|
---|
class airflow.contrib.operators.emr_terminate_job_flow_operator.EmrTerminateJobFlowOperator(job_flow_id, aws_conn_id='s3_default', *args, **kwargs)
Bases: airflow.models.BaseOperator
Operator to terminate EMR JobFlows.
Parameters: |
|
---|
class airflow.contrib.hooks.emr_hook.EmrHook(emr_conn_id=None, *args, **kwargs)
Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with AWS EMR. emr_conn_id is only neccessary for using the create_job_flow method.
create_job_flow(job_flow_overrides)
Creates a job flow using the config from the EMR connection. Keys of the json extra hash may have the arguments of the boto3 run_job_flow method. Overrides for this config may be passed as the job_flow_overrides.
class airflow.hooks.S3_hook.S3Hook(aws_conn_id='aws_default')
Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with AWS S3, using the boto3 library.
check_for_bucket(bucket_name)
Check if bucket_name exists.
Parameters: | bucket_name (str) – the name of the bucket |
---|
check_for_key(key, bucket_name=None)
Checks if a key exists in a bucket
Parameters: |
|
---|
check_for_prefix(bucket_name, prefix, delimiter)
Checks that a prefix exists in a bucket
check_for_wildcard_key(wildcard_key, bucket_name=None, delimiter='')
Checks that a key matching a wildcard expression exists in a bucket
get_bucket(bucket_name)
Returns a boto3.S3.Bucket object
Parameters: | bucket_name (str) – the name of the bucket |
---|
get_key(key, bucket_name=None)
Returns a boto3.s3.Object
Parameters: |
|
---|
get_wildcard_key(wildcard_key, bucket_name=None, delimiter='')
Returns a boto3.s3.Object object matching the wildcard expression
Parameters: |
|
---|
list_keys(bucket_name, prefix='', delimiter='', page_size=None, max_items=None)
Lists keys in a bucket under prefix and not containing delimiter
Parameters: |
|
---|
list_prefixes(bucket_name, prefix='', delimiter='', page_size=None, max_items=None)
Lists prefixes in a bucket under prefix
Parameters: |
|
---|
load_bytes(bytes_data, key, bucket_name=None, replace=False, encrypt=False)
Loads bytes to S3
This is provided as a convenience to drop a string in S3. It uses the boto infrastructure to ship a file to s3.
Parameters: |
|
---|
load_file(filename, key, bucket_name=None, replace=False, encrypt=False)
Loads a local file to S3
Parameters: |
|
---|
load_string(string_data, key, bucket_name=None, replace=False, encrypt=False, encoding='utf-8')
Loads a string to S3
This is provided as a convenience to drop a string in S3. It uses the boto infrastructure to ship a file to s3.
Parameters: |
|
---|
read_key(key, bucket_name=None)
Reads a key from S3
Parameters: |
|
---|
select_key(key, bucket_name=None, expression='SELECT * FROM S3Object', expression_type='SQL', input_serialization={'CSV': {}}, output_serialization={'CSV': {}})
Reads a key with S3 Select.
Parameters: |
|
---|---|
Returns: | retrieved subset of original data by S3 Select |
Return type: | str |
See also
For more details about S3 Select parameters: http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.select_object_content
class airflow.operators.s3_file_transform_operator.S3FileTransformOperator(source_s3_key, dest_s3_key, transform_script=None, select_expression=None, source_aws_conn_id='aws_default', dest_aws_conn_id='aws_default', replace=False, *args, **kwargs)
Bases: airflow.models.BaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem. Runs a transformation on this file as specified by the transformation script and uploads the output to a destination S3 location.
The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script. The transformation script is expected to read the data from source, transform it and write the output to the local destination file. The operator then takes over control and uploads the local destination file to S3.
S3 Select is also available to filter the source contents. Users can omit the transformation script if S3 Select expression is specified.
Parameters: |
|
---|
class airflow.contrib.operators.s3_list_operator.S3ListOperator(bucket, prefix='', delimiter='', aws_conn_id='aws_default', *args, **kwargs)
Bases: airflow.models.BaseOperator
List all objects from the bucket with the given string prefix in name.
This operator returns a python list with the name of objects which can be used by xcom in the downstream task.
Parameters: |
|
---|
Example:
The following operator would list all the files
(excluding subfolders) from the S3
customers/2018/04/
key in the data
bucket.
s3_file = S3ListOperator(
task_id='list_3s_files',
bucket='data',
prefix='customers/2018/04/',
delimiter='/',
aws_conn_id='aws_customers_conn'
)
class airflow.contrib.operators.s3_to_gcs_operator.S3ToGoogleCloudStorageOperator(bucket, prefix='', delimiter='', aws_conn_id='aws_default', dest_gcs_conn_id=None, dest_gcs=None, delegate_to=None, replace=False, *args, **kwargs)
Bases: airflow.contrib.operators.s3_list_operator.S3ListOperator
Synchronizes an S3 key, possibly a prefix, with a Google Cloud Storage destination path.
Parameters: |
|
---|
Example: .. code-block:: python
s3_to_gcs_op = S3ToGoogleCloudStorageOperator(task_id=’s3_to_gcs_example’, bucket=’my-s3-bucket’, prefix=’data/customers-201804’, dest_gcs_conn_id=’google_cloud_default’, dest_gcs=’gs://my.gcs.bucket/some/customers/’, replace=False, dag=my-dag)
Note that bucket
, prefix
, delimiter
and dest_gcs
are
templated, so you can use variables in them if you wish.
class airflow.operators.s3_to_hive_operator.S3ToHiveTransfer(s3_key, field_dict, hive_table, delimiter=', ', create=True, recreate=False, partition=None, headers=False, check_headers=False, wildcard_match=False, aws_conn_id='aws_default', hive_cli_conn_id='hive_cli_default', input_compressed=False, tblproperties=None, select_expression=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Moves data from S3 to Hive. The operator downloads a file from S3,
stores the file locally before loading it into a Hive table.
If the create
or recreate
arguments are set to True
,
a CREATE TABLE
and DROP TABLE
statements are generated.
Hive data types are inferred from the cursor’s metadata from.
Note that the table generated in Hive uses STORED AS textfile
which isn’t the most efficient serialization format. If a
large amount of data is loaded and/or if the tables gets
queried considerably, you may want to use this operator only to
stage the data into a temporary table before loading it into its
final destination using a HiveOperator
.
Parameters: |
|
---|
class airflow.contrib.operators.ecs_operator.ECSOperator(task_definition, cluster, overrides, aws_conn_id=None, region_name=None, launch_type='EC2', **kwargs)
Bases: airflow.models.BaseOperator
Execute a task on AWS EC2 Container Service
Parameters: |
|
---|---|
Param: | overrides: the same parameter that boto3 will receive (templated): http://boto3.readthedocs.org/en/latest/reference/services/ecs.html#ECS.Client.run_task |
Type: | overrides: dict |
Type: | launch_type: str |
class airflow.contrib.operators.awsbatch_operator.AWSBatchOperator(job_name, job_definition, job_queue, overrides, max_retries=4200, aws_conn_id=None, region_name=None, **kwargs)
Bases: airflow.models.BaseOperator
Execute a job on AWS Batch Service
Parameters: |
|
---|---|
Param: | overrides: the same parameter that boto3 will receive on containerOverrides (templated): http://boto3.readthedocs.io/en/latest/reference/services/batch.html#submit_job |
Type: | overrides: dict |
class airflow.contrib.sensors.aws_redshift_cluster_sensor.AwsRedshiftClusterSensor(cluster_identifier, target_status='available', aws_conn_id='aws_default', *args, **kwargs)
Bases: airflow.sensors.base_sensor_operator.BaseSensorOperator
Waits for a Redshift cluster to reach a specific status.
Parameters: |
|
---|
poke(context)
Function that the sensors defined while deriving this class should override.
class airflow.contrib.hooks.redshift_hook.RedshiftHook(aws_conn_id='aws_default')
Bases: airflow.contrib.hooks.aws_hook.AwsHook
Interact with AWS Redshift, using the boto3 library
cluster_status(cluster_identifier)
Return status of a cluster
Parameters: | cluster_identifier (str) – unique identifier of a cluster |
---|
create_cluster_snapshot(snapshot_identifier, cluster_identifier)
Creates a snapshot of a cluster
Parameters: |
|
---|
delete_cluster(cluster_identifier, skip_final_cluster_snapshot=True, final_cluster_snapshot_identifier='')
Delete a cluster and optionally create a snapshot
Parameters: |
|
---|
describe_cluster_snapshots(cluster_identifier)
Gets a list of snapshots for a cluster
Parameters: | cluster_identifier (str) – unique identifier of a cluster |
---|
restore_from_cluster_snapshot(cluster_identifier, snapshot_identifier)
Restores a cluster from its snapshot
Parameters: |
|
---|
class airflow.operators.redshift_to_s3_operator.RedshiftToS3Transfer(schema, table, s3_bucket, s3_key, redshift_conn_id='redshift_default', aws_conn_id='aws_default', unload_options=(), autocommit=False, parameters=None, include_header=False, *args, **kwargs)
Bases: airflow.models.BaseOperator
Executes an UNLOAD command to s3 as a CSV with headers
Parameters: |
|
---|
class airflow.operators.s3_to_redshift_operator.S3ToRedshiftTransfer(schema, table, s3_bucket, s3_key, redshift_conn_id='redshift_default', aws_conn_id='aws_default', copy_options=(), autocommit=False, parameters=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Executes an COPY command to load files from s3 to Redshift
Parameters: |
|
---|
Databricks has contributed an Airflow operator which enables
submitting runs to the Databricks platform. Internally the operator talks to the
api/2.0/jobs/runs/submit
endpoint.
class airflow.contrib.operators.databricks_operator.DatabricksSubmitRunOperator(json=None, spark_jar_task=None, notebook_task=None, new_cluster=None, existing_cluster_id=None, libraries=None, run_name=None, timeout_seconds=None, databricks_conn_id='databricks_default', polling_period_seconds=30, databricks_retry_limit=3, do_xcom_push=False, **kwargs)
Bases: airflow.models.BaseOperator
Submits an Spark job run to Databricks using the api/2.0/jobs/runs/submit API endpoint.
There are two ways to instantiate this operator.
In the first way, you can take the JSON payload that you typically use
to call the api/2.0/jobs/runs/submit
endpoint and pass it directly
to our DatabricksSubmitRunOperator
through the json
parameter.
For example
json = {
'new_cluster': {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
},
'notebook_task': {
'notebook_path': '/Users/airflow@example.com/PrepareData',
},
}
notebook_run = DatabricksSubmitRunOperator(task_id='notebook_run', json=json)
Another way to accomplish the same thing is to use the named parameters
of the DatabricksSubmitRunOperator
directly. Note that there is exactly
one named parameter for each top level parameter in the runs/submit
endpoint. In this method, your code would look like this:
new_cluster = {
'spark_version': '2.1.0-db3-scala2.11',
'num_workers': 2
}
notebook_task = {
'notebook_path': '/Users/airflow@example.com/PrepareData',
}
notebook_run = DatabricksSubmitRunOperator(
task_id='notebook_run',
new_cluster=new_cluster,
notebook_task=notebook_task)
In the case where both the json parameter AND the named parameters
are provided, they will be merged together. If there are conflicts during the merge,
the named parameters will take precedence and override the top level json
keys.
Currently the named parameters that DatabricksSubmitRunOperator supports are
spark_jar_task
notebook_task
new_cluster
existing_cluster_id
libraries
run_name
timeout_seconds
Parameters: |
|
---|
Airflow has extensive support for the Google Cloud Platform. But note that most Hooks and Operators are in the contrib section. Meaning that they have a beta status, meaning that they can have breaking changes between minor releases.
See the GCP connection type documentation to configure connections to GCP.
Airflow can be configured to read and write task logs in Google Cloud Storage. See Writing Logs to Google Cloud Storage.
class airflow.contrib.operators.bigquery_check_operator.BigQueryCheckOperator(sql, bigquery_conn_id='bigquery_default', *args, **kwargs)
Bases: airflow.operators.check_operator.CheckOperator
Performs checks against BigQuery. The BigQueryCheckOperator
expects
a sql query that will return a single row. Each value on that
first row is evaluated using python bool
casting. If any of the
values return False
the check is failed and errors out.
Note that Python bool casting evals the following as False
:
False
0
""
)[]
){}
)Given a query like SELECT COUNT(*) FROM foo
, it will fail only if
the count == 0
. You can craft much more complex query that could,
for instance, check that the table has the same number of rows as
the source table upstream, or that the count of today’s partition is
greater than yesterday’s partition, or that a set of metrics are less
than 3 standard deviation for the 7 day average.
This operator can be used as a data quality check in your pipeline, and depending on where you put it in your DAG, you have the choice to stop the critical path, preventing from publishing dubious data, or on the side and receive email alterts without stopping the progress of the DAG.
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_check_operator.BigQueryValueCheckOperator(sql, pass_value, tolerance=None, bigquery_conn_id='bigquery_default', *args, **kwargs)
Bases: airflow.operators.check_operator.ValueCheckOperator
Performs a simple value check using sql code.
Parameters: | sql (string) – the sql to be executed |
---|
class airflow.contrib.operators.bigquery_check_operator.BigQueryIntervalCheckOperator(table, metrics_thresholds, date_filter_column='ds', days_back=-7, bigquery_conn_id='bigquery_default', *args, **kwargs)
Bases: airflow.operators.check_operator.IntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before.
This method constructs a query like so
SELECT {metrics_threshold_dict_key} FROM {table}
WHERE {date_filter_column}=<date>
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_get_data.BigQueryGetDataOperator(dataset_id, table_id, max_results='100', selected_fields=None, bigquery_conn_id='bigquery_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list. The number of elements in the returned list will be equal to the number of rows fetched. Each element in the list will again be a list where element would represent the columns values for that row.
Example Result: [['Tony', '10'], ['Mike', '20'], ['Steve', '15']]
Note
If you pass fields to selected_fields
which are in different order than the
order of columns already in
BQ table, the data will still be in the order of BQ table.
For example if the BQ table has 3 columns as
[A,B,C]
and you pass ‘B,A’ in the selected_fields
the data would still be of the form 'A,B'
.
Example:
get_data = BigQueryGetDataOperator(
task_id='get_data_from_bq',
dataset_id='test_dataset',
table_id='Transaction_partitions',
max_results='100',
selected_fields='DATE',
bigquery_conn_id='airflow-service-account'
)
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_operator.BigQueryCreateEmptyTableOperator(dataset_id, table_id, project_id=None, schema_fields=None, gcs_schema_object=None, time_partitioning={}, bigquery_conn_id='bigquery_default', google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Creates a new, empty table in the specified BigQuery dataset, optionally with schema.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google cloud storage must be a JSON file with the schema fields in it. You can also create a table without schema.
Parameters: |
|
---|
Example (with schema JSON in GCS):
CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
gcs_schema_object='gs://schema-bucket/employee_schema.json',
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)
Corresponding Schema file (employee_schema.json
):
[
{
"mode": "NULLABLE",
"name": "emp_name",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "salary",
"type": "INTEGER"
}
]
Example (with schema in the DAG):
CreateTable = BigQueryCreateEmptyTableOperator(
task_id='BigQueryCreateEmptyTableOperator_task',
dataset_id='ODS',
table_id='Employees',
project_id='internal-gcp-project',
schema_fields=[{"name": "emp_name", "type": "STRING", "mode": "REQUIRED"},
{"name": "salary", "type": "INTEGER", "mode": "NULLABLE"}],
bigquery_conn_id='airflow-service-account',
google_cloud_storage_conn_id='airflow-service-account'
)
class airflow.contrib.operators.bigquery_operator.BigQueryCreateExternalTableOperator(bucket, source_objects, destination_project_dataset_table, schema_fields=None, schema_object=None, source_format='CSV', compression='NONE', skip_leading_rows=0, field_delimiter=', ', max_bad_records=0, quote_character=None, allow_quoted_newlines=False, allow_jagged_rows=False, bigquery_conn_id='bigquery_default', google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, src_fmt_configs={}, *args, **kwargs)
Bases: airflow.models.BaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google cloud storage must be a JSON file with the schema fields in it.
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_operator.BigQueryOperator(bql=None, sql=None, destination_dataset_table=False, write_disposition='WRITE_EMPTY', allow_large_results=False, flatten_results=False, bigquery_conn_id='bigquery_default', delegate_to=None, udf_config=False, use_legacy_sql=True, maximum_billing_tier=None, maximum_bytes_billed=None, create_disposition='CREATE_IF_NEEDED', schema_update_options=(), query_params=None, priority='INTERACTIVE', time_partitioning={}, *args, **kwargs)
Bases: airflow.models.BaseOperator
Executes BigQuery SQL queries in a specific BigQuery database
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_table_delete_operator.BigQueryTableDeleteOperator(deletion_dataset_table, bigquery_conn_id='bigquery_default', delegate_to=None, ignore_if_missing=False, *args, **kwargs)
Bases: airflow.models.BaseOperator
Deletes BigQuery tables
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_to_bigquery.BigQueryToBigQueryOperator(source_project_dataset_tables, destination_project_dataset_table, write_disposition='WRITE_EMPTY', create_disposition='CREATE_IF_NEEDED', bigquery_conn_id='bigquery_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Copies data from one BigQuery table to another.
See also
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.copy
Parameters: |
|
---|
class airflow.contrib.operators.bigquery_to_gcs.BigQueryToCloudStorageOperator(source_project_dataset_table, destination_cloud_storage_uris, compression='NONE', export_format='CSV', field_delimiter=', ', print_header=True, bigquery_conn_id='bigquery_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket.
See also
For more details about these parameters: https://cloud.google.com/bigquery/docs/reference/v2/jobs
Parameters: |
|
---|
class airflow.contrib.hooks.bigquery_hook.BigQueryHook(bigquery_conn_id='bigquery_default', delegate_to=None, use_legacy_sql=True)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
, airflow.hooks.dbapi_hook.DbApiHook
, airflow.utils.log.logging_mixin.LoggingMixin
Interact with BigQuery. This hook uses the Google Cloud Platform connection.
get_conn()
Returns a BigQuery PEP 249 connection object.
get_pandas_df(sql, parameters=None, dialect=None)
Returns a Pandas DataFrame for the results produced by a BigQuery query. The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections, except for SQLite. See:
https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L447 https://github.com/pydata/pandas/issues/6900
Parameters: |
|
---|
get_service()
Returns a BigQuery service object.
insert_rows(table, rows, target_fields=None, commit_every=1000)
Insertion is currently unsupported. Theoretically, you could use BigQuery’s streaming API to insert rows into a table, but this hasn’t been implemented.
table_exists(project_id, dataset_id, table_id)
Checks for the existence of a table in Google BigQuery.
Parameters: |
|
---|
class airflow.contrib.operators.dataflow_operator.DataFlowJavaOperator(jar, dataflow_default_options=None, options=None, gcp_conn_id='google_cloud_default', delegate_to=None, poll_sleep=10, job_class=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Java Cloud DataFlow batch job. The parameters of the operation will be passed to the job.
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and staging location.
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'europe-west1-d',
'stagingLocation': 'gs://my-staging-bucket/staging/'
}
}
You need to pass the path to your dataflow as a file reference with the jar
parameter, the jar needs to be a self executing jar (see documentation here:
https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar).
Use options
to pass on options to your job.
t1 = DataFlowOperation(
task_id='datapflow_example',
jar='{{var.value.gcp_dataflow_base}}pipeline/build/libs/pipeline-example-1.0.jar',
options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
'start': '{{ds}}',
'partitionType': 'DAY',
'labels': {'foo' : 'bar'}
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)
Both jar
and options
are templated so you can use variables in them.
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date':
(2016, 8, 1),
'email': ['alex@vanboxel.be'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=30),
'dataflow_default_options': {
'project': 'my-gcp-project',
'zone': 'us-central1-f',
'stagingLocation': 'gs://bucket/tmp/dataflow/staging/',
}
}
dag = DAG('test-dag', default_args=default_args)
task = DataFlowJavaOperator(
gcp_conn_id='gcp_default',
task_id='normalize-cal',
jar='{{var.value.gcp_dataflow_base}}pipeline-ingress-cal-normalize-1.0.jar',
options={
'autoscalingAlgorithm': 'BASIC',
'maxNumWorkers': '50',
'start': '{{ds}}',
'partitionType': 'DAY'
},
dag=dag)
class airflow.contrib.operators.dataflow_operator.DataflowTemplateOperator(template, dataflow_default_options=None, parameters=None, gcp_conn_id='google_cloud_default', delegate_to=None, poll_sleep=10, *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Templated Cloud DataFlow batch job. The parameters of the operation will be passed to the job. It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project, zone and staging location.
See also
https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment
default_args = {
'dataflow_default_options': {
'project': 'my-gcp-project'
'zone': 'europe-west1-d',
'tempLocation': 'gs://my-staging-bucket/staging/'
}
}
}
You need to pass the path to your dataflow template as a file reference with the
template
parameter. Use parameters
to pass on parameters to your job.
Use environment
to pass on runtime environment variables to your job.
t1 = DataflowTemplateOperator(
task_id='datapflow_example',
template='{{var.value.gcp_dataflow_base}}',
parameters={
'inputFile': "gs://bucket/input/my_input.txt",
'outputFile': "gs://bucket/output/my_output.txt"
},
gcp_conn_id='gcp-airflow-service-account',
dag=my-dag)
template
, dataflow_default_options
and parameters
are templated so you can
use variables in them.
class airflow.contrib.operators.dataflow_operator.DataFlowPythonOperator(py_file, py_options=None, dataflow_default_options=None, options=None, gcp_conn_id='google_cloud_default', delegate_to=None, poll_sleep=10, *args, **kwargs)
Bases: airflow.models.BaseOperator
execute(context)
Execute the python dataflow job.
class airflow.contrib.hooks.gcp_dataflow_hook.DataFlowHook(gcp_conn_id='google_cloud_default', delegate_to=None, poll_sleep=10)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
get_conn()
Returns a Google Cloud Storage service object.
class airflow.contrib.operators.dataproc_operator.DataprocClusterCreateOperator(cluster_name, project_id, num_workers, zone, network_uri=None, subnetwork_uri=None, internal_ip_only=None, tags=None, storage_bucket=None, init_actions_uris=None, init_action_timeout='10m', metadata=None, image_version=None, properties=None, master_machine_type='n1-standard-4', master_disk_size=500, worker_machine_type='n1-standard-4', worker_disk_size=500, num_preemptible_workers=0, labels=None, region='global', gcp_conn_id='google_cloud_default', delegate_to=None, service_account=None, service_account_scopes=None, idle_delete_ttl=None, auto_delete_time=None, auto_delete_ttl=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Create a new cluster on Google Cloud Dataproc. The operator will wait until the creation is successful or an error occurs in the creation process.
The parameters allow to configure the cluster. Please refer to
https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters
for a detailed explanation on the different parameters. Most of the configuration parameters detailed in the link are available as a parameter to this operator.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataprocClusterScaleOperator(cluster_name, project_id, region='global', gcp_conn_id='google_cloud_default', delegate_to=None, num_workers=2, num_preemptible_workers=0, graceful_decommission_timeout=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Scale, up or down, a cluster on Google Cloud Dataproc. The operator will wait until the cluster is re-scaled.
Example:
t1 = DataprocClusterScaleOperator(task_id=’dataproc_scale’, project_id=’my-project’, cluster_name=’cluster-1’, num_workers=10, num_preemptible_workers=10, graceful_decommission_timeout=‘1h’ dag=dag)
See also
For more detail on about scaling clusters have a look at the reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataprocClusterDeleteOperator(cluster_name, project_id, region='global', gcp_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Delete a cluster on Google Cloud Dataproc. The operator will wait until the cluster is destroyed.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataProcPigOperator(query=None, query_uri=None, variables=None, job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_pig_properties=None, dataproc_pig_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, region='global', *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation will be passed to the cluster.
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs.
default_args = {
'cluster_name': 'cluster-1',
'dataproc_pig_jars': [
'gs://example/udf/jar/datafu/1.2.0/datafu.jar',
'gs://example/udf/jar/gpig/1.2/gpig.jar'
]
}
You can pass a pig script as string or file reference. Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters.
Example:
t1 = DataProcPigOperator(
task_id='dataproc_pig',
query='a_pig_script.pig',
variables={'out': 'gs://example/output/{{ds}}'},
dag=dag)
See also
For more detail on about job submission have a look at the reference: https://cloud.google.com/dataproc/reference/rest/v1/projects.regions.jobs
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataProcHiveOperator(query=None, query_uri=None, variables=None, job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_hive_properties=None, dataproc_hive_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, region='global', *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Hive query Job on a Cloud DataProc cluster.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataProcSparkSqlOperator(query=None, query_uri=None, variables=None, job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_spark_properties=None, dataproc_spark_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, region='global', *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataProcSparkOperator(main_jar=None, main_class=None, arguments=None, archives=None, files=None, job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_spark_properties=None, dataproc_spark_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, region='global', *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Spark Job on a Cloud DataProc cluster.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataProcHadoopOperator(main_jar=None, main_class=None, arguments=None, archives=None, files=None, job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_hadoop_properties=None, dataproc_hadoop_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, region='global', *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Hadoop Job on a Cloud DataProc cluster.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataProcPySparkOperator(main, arguments=None, archives=None, pyfiles=None, files=None, job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_pyspark_properties=None, dataproc_pyspark_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, region='global', *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a PySpark Job on a Cloud DataProc cluster.
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateOperator(template_id, *args, **kwargs)
Bases: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate on Google Cloud Dataproc. The operator will wait until the WorkflowTemplate is finished executing.
See also
Please refer to: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.workflowTemplates/instantiate
Parameters: |
|
---|
class airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateInstantiateInlineOperator(template, *args, **kwargs)
Bases: airflow.contrib.operators.dataproc_operator.DataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc. The operator will wait until the WorkflowTemplate is finished executing.
See also
Please refer to: https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.workflowTemplates/instantiateInline
Parameters: |
|
---|
class airflow.contrib.operators.datastore_export_operator.DatastoreExportOperator(bucket, namespace=None, datastore_conn_id='google_cloud_default', cloud_storage_conn_id='google_cloud_default', delegate_to=None, entity_filter=None, labels=None, polling_interval_in_seconds=10, overwrite_existing=False, xcom_push=False, *args, **kwargs)
Bases: airflow.models.BaseOperator
Export entities from Google Cloud Datastore to Cloud Storage
Parameters: |
|
---|
class airflow.contrib.operators.datastore_import_operator.DatastoreImportOperator(bucket, file, namespace=None, entity_filter=None, labels=None, datastore_conn_id='google_cloud_default', delegate_to=None, polling_interval_in_seconds=10, xcom_push=False, *args, **kwargs)
Bases: airflow.models.BaseOperator
Import entities from Cloud Storage to Google Cloud Datastore
Parameters: |
|
---|
class airflow.contrib.hooks.datastore_hook.DatastoreHook(datastore_conn_id='google_cloud_datastore_default', delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Interact with Google Cloud Datastore. This hook uses the Google Cloud Platform connection.
This object is not threads safe. If you want to make multiple requests simultaneously, you will need to create a hook per thread.
allocate_ids(partialKeys)
Allocate IDs for incomplete keys. see https://cloud.google.com/datastore/docs/reference/rest/v1/projects/allocateIds
Parameters: | partialKeys – a list of partial keys |
---|---|
Returns: | a list of full keys. |
begin_transaction()
Get a new transaction handle
Returns: | a transaction handle |
---|
commit(body)
Commit a transaction, optionally creating, deleting or modifying some entities.
Parameters: | body – the body of the commit request |
---|---|
Returns: | the response body of the commit request |
delete_operation(name)
Deletes the long-running operation
Parameters: | name – the name of the operation resource |
---|
export_to_storage_bucket(bucket, namespace=None, entity_filter=None, labels=None)
Export entities from Cloud Datastore to Cloud Storage for backup
get_conn(version='v1')
Returns a Google Cloud Storage service object.
get_operation(name)
Gets the latest state of a long-running operation
Parameters: | name – the name of the operation resource |
---|
import_from_storage_bucket(bucket, file, namespace=None, entity_filter=None, labels=None)
Import a backup from Cloud Storage to Cloud Datastore
lookup(keys, read_consistency=None, transaction=None)
Lookup some entities by key
Parameters: |
|
---|---|
Returns: | the response body of the lookup request. |
poll_operation_until_done(name, polling_interval_in_seconds)
Poll backup operation state until it’s completed
rollback(transaction)
Roll back a transaction
Parameters: | transaction – the transaction to roll back |
---|
run_query(body)
Run a query for entities.
Parameters: | body – the body of the query request |
---|---|
Returns: | the batch of query results. |
class airflow.contrib.operators.mlengine_operator.MLEngineBatchPredictionOperator(project_id, job_id, region, data_format, input_paths, output_path, model_name=None, version_name=None, uri=None, max_worker_count=None, runtime_version=None, gcp_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Start a Google Cloud ML Engine prediction job.
NOTE: For model origin, users should consider exactly one from the three options below: 1. Populate ‘uri’ field only, which should be a GCS location that points to a tensorflow savedModel directory. 2. Populate ‘model_name’ field only, which refers to an existing model, and the default version of the model will be used. 3. Populate both ‘model_name’ and ‘version_name’ fields, which refers to a specific version of a specific model.
In options 2 and 3, both model and version name should contain the minimal identifier. For instance, call
MLEngineBatchPredictionOperator(
...,
model_name='my_model',
version_name='my_version',
...)
if the desired model version is “projects/my_project/models/my_model/versions/my_version”.
See https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs for further documentation on the parameters.
Parameters: |
|
---|
Raises:
ValueError
: if a unique model/version origin cannot be determined.
class airflow.contrib.operators.mlengine_operator.MLEngineModelOperator(project_id, model, operation='create', gcp_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Operator for managing a Google Cloud ML Engine model.
Parameters: |
|
---|
class airflow.contrib.operators.mlengine_operator.MLEngineTrainingOperator(project_id, job_id, package_uris, training_python_module, training_args, region, scale_tier=None, runtime_version=None, python_version=None, job_dir=None, gcp_conn_id='google_cloud_default', delegate_to=None, mode='PRODUCTION', *args, **kwargs)
Bases: airflow.models.BaseOperator
Operator for launching a MLEngine training job.
Parameters: |
|
---|
class airflow.contrib.operators.mlengine_operator.MLEngineVersionOperator(project_id, model_name, version_name=None, version=None, operation='create', gcp_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Operator for managing a Google Cloud ML Engine version.
Parameters: |
|
---|
class airflow.contrib.hooks.gcp_mlengine_hook.MLEngineHook(gcp_conn_id='google_cloud_default', delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
create_job(project_id, job, use_existing_job_fn=None)
Launches a MLEngine job and wait for it to reach a terminal state.
Parameters: |
|
---|---|
Returns: | The MLEngine job object if the job successfully reach a terminal state (which might be FAILED or CANCELLED state). |
Return type: | dict |
create_model(project_id, model)
Create a Model. Blocks until finished.
create_version(project_id, model_name, version_spec)
Creates the Version on Google Cloud ML Engine.
Returns the operation if the version was created successfully and raises an error otherwise.
delete_version(project_id, model_name, version_name)
Deletes the given version of a model. Blocks until finished.
get_conn()
Returns a Google MLEngine service object.
get_model(project_id, model_name)
Gets a Model. Blocks until finished.
list_versions(project_id, model_name)
Lists all available versions of a model. Blocks until finished.
set_default_version(project_id, model_name, version_name)
Sets a version to be the default. Blocks until finished.
class airflow.contrib.operators.file_to_gcs.FileToGoogleCloudStorageOperator(src, dst, bucket, google_cloud_storage_conn_id='google_cloud_default', mime_type='application/octet-stream', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Uploads a file to Google Cloud Storage
Parameters: |
|
---|
execute(context)
Uploads the file to Google cloud storage
class airflow.contrib.operators.gcs_operator.GoogleCloudStorageCreateBucketOperator(bucket_name, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None, google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.
See also
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements
Parameters: |
|
---|
Example:
The following Operator would create a new bucket test-bucket
with MULTI_REGIONAL
storage class in EU
region
CreateBucket = GoogleCloudStorageCreateBucketOperator(
task_id='CreateNewBucket',
bucket_name='test-bucket',
storage_class='MULTI_REGIONAL',
location='EU',
labels={'env': 'dev', 'team': 'airflow'},
google_cloud_storage_conn_id='airflow-service-account'
)
class airflow.contrib.operators.gcs_download_operator.GoogleCloudStorageDownloadOperator(bucket, object, filename=None, store_to_xcom_key=None, google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Downloads a file from Google Cloud Storage.
Parameters: |
|
---|
class airflow.contrib.operators.gcs_list_operator.GoogleCloudStorageListOperator(bucket, prefix=None, delimiter=None, google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
List all objects from the bucket with the give string prefix and delimiter in name.
This operator returns a python list with the name of objects which can be used byxcom in the downstream task.
Parameters: |
|
---|
Example:
The following Operator would list all the Avro files from sales/sales-2017
folder in data
bucket.
GCS_Files = GoogleCloudStorageListOperator(
task_id='GCS_Files',
bucket='data',
prefix='sales/sales-2017/',
delimiter='.avro',
google_cloud_storage_conn_id=google_cloud_conn_id
)
class airflow.contrib.operators.gcs_to_bq.GoogleCloudStorageToBigQueryOperator(bucket, source_objects, destination_project_dataset_table, schema_fields=None, schema_object=None, source_format='CSV', compression='NONE', create_disposition='CREATE_IF_NEEDED', skip_leading_rows=0, write_disposition='WRITE_EMPTY', field_delimiter=', ', max_bad_records=0, quote_character=None, ignore_unknown_values=False, allow_quoted_newlines=False, allow_jagged_rows=False, max_id_key=None, bigquery_conn_id='bigquery_default', google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, schema_update_options=(), src_fmt_configs={}, external_table=False, time_partitioning={}, *args, **kwargs)
Bases: airflow.models.BaseOperator
Loads files from Google cloud storage into BigQuery.
The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass the schema fields in, or you may point the operator to a Google cloud storage object name. The object in Google cloud storage must be a JSON file with the schema fields in it.
Parameters: |
|
---|
class airflow.contrib.operators.gcs_to_gcs.GoogleCloudStorageToGoogleCloudStorageOperator(source_bucket, source_object, destination_bucket=None, destination_object=None, move_object=False, google_cloud_storage_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs)
Bases: airflow.models.BaseOperator
Copies objects from a bucket to another, with renaming if requested.
Parameters: |
|
---|
where the object should be. (templated) :type destination_bucket: string :param destination_object: The destination name of the object in the
destination Google cloud storage bucket. (templated) If a wildcard is supplied in the source_object argument, this is the prefix that will be prepended to the final destination objects’ paths. Note that the source path’s part before the wildcard will be removed; if it needs to be retained it should be appended to destination_object. For example, with prefixfoo/*
and destination_object ‘blah/`, the filefoo/baz
will be copied toblah/baz
; to retain the prefix write the destination_object as e.g.blah/foo
, in which case the copied file will be namedblah/foo/baz
.
Parameters: | move_object – When move object is True, the object is moved instead |
---|
of copied to the new location.This is the equivalent of a mv command as opposed to a cp command.
Parameters: |
|
---|
Examples:
The following Operator would copy a single file named
sales/sales-2017/january.avro
in the data
bucket to the file named
copied_sales/2017/january-backup.avro` in the ``data_backup
bucket
copy_single_file = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='copy_single_file',
source_bucket='data',
source_object='sales/sales-2017/january.avro',
destination_bucket='data_backup',
destination_object='copied_sales/2017/january-backup.avro',
google_cloud_storage_conn_id=google_cloud_conn_id
)
The following Operator would copy all the Avro files from sales/sales-2017
folder (i.e. with names starting with that prefix) in data
bucket to the
copied_sales/2017
folder in the data_backup
bucket.
copy_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='copy_files',
source_bucket='data',
source_object='sales/sales-2017/*.avro',
destination_bucket='data_backup',
destination_object='copied_sales/2017/',
google_cloud_storage_conn_id=google_cloud_conn_id
)
The following Operator would move all the Avro files from sales/sales-2017
folder (i.e. with names starting with that prefix) in data
bucket to the
same folder in the data_backup
bucket, deleting the original files in the
process.
move_files = GoogleCloudStorageToGoogleCloudStorageOperator(
task_id='move_files',
source_bucket='data',
source_object='sales/sales-2017/*.avro',
destination_bucket='data_backup',
move_object=True,
google_cloud_storage_conn_id=google_cloud_conn_id
)
class airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default', delegate_to=None)
Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook
Interact with Google Cloud Storage. This hook uses the Google Cloud Platform connection.
copy(source_bucket, source_object, destination_bucket=None, destination_object=None)
Copies an object from a bucket to another, with renaming if requested.
destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.
Parameters: |
|
---|
create_bucket(bucket_name, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None)
Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.
See also
For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements
Parameters: |
|
---|---|
Returns: | If successful, it returns the |
delete(bucket, object, generation=None)
Delete an object if versioning is not enabled for the bucket, or if generation parameter is used.
Parameters: |
|
---|---|
Returns: | True if succeeded |
download(bucket, object, filename=None)
Get a file from Google Cloud Storage.
Parameters: |
|
---|
exists(bucket, object)
Checks for the existence of a file in Google Cloud Storage.
Parameters: |
|
---|
get_conn()
Returns a Google Cloud Storage service object.
get_crc32c(bucket, object)
Gets the CRC32c checksum of an object in Google Cloud Storage.
Parameters: |
|
---|
get_md5hash(bucket, object)
Gets the MD5 hash of an object in Google Cloud Storage.
Parameters: |
|
---|
get_size(bucket, object)
Gets the size of a file in Google Cloud Storage.
Parameters: |
|
---|
is_updated_after(bucket, object, ts)
Checks if an object is updated in Google Cloud Storage.
Parameters: |
|
---|
list(bucket, versions=None, maxResults=None, prefix=None, delimiter=None)
List all objects from the bucket with the give string prefix in name
Parameters: |
|
---|---|
Returns: | a stream of object names matching the filtering criteria |
rewrite(source_bucket, source_object, destination_bucket, destination_object=None)
Has the same functionality as copy, except that will work on files over 5 TB, as well as when copying between locations and/or storage classes.
destination_object can be omitted, in which case source_object is used.
Parameters: |
|
---|
upload(bucket, object, filename, mime_type='application/octet-stream')
Uploads a local file to Google Cloud Storage.
Parameters: |
|
---|
class airflow.contrib.operators.gcp_container_operator.GKEClusterCreateOperator(project_id, location, body={}, gcp_conn_id='google_cloud_default', api_version='v2', *args, **kwargs)
Bases: airflow.models.BaseOperator
class airflow.contrib.operators.gcp_container_operator.GKEClusterDeleteOperator(project_id, name, location, gcp_conn_id='google_cloud_default', api_version='v2', *args, **kwargs)
Bases: airflow.models.BaseOperator
class airflow.contrib.hooks.gcp_container_hook.GKEClusterHook(project_id, location)
Bases: airflow.hooks.base_hook.BaseHook
create_cluster(cluster, retry=<object object>, timeout=<object object>)
Creates a cluster, consisting of the specified number and type of Google Compute Engine instances.
Parameters: |
|
---|---|
Returns: | The full url to the new, or existing, cluster |
:raisesParseError: On JSON parsing problems when trying to convert dict AirflowException: cluster is not dict type nor Cluster proto type
delete_cluster(name, retry=<object object>, timeout=<object object>)
Deletes the cluster, including the Kubernetes endpoint and all worker nodes. Firewalls and routes that were configured during cluster creation are also deleted. Other Google Compute Engine resources that might be in use by the cluster (e.g. load balancer resources) will not be deleted if they weren’t present at the initial create time.
Parameters: |
|
---|---|
Returns: | The full url to the delete operation if successful, else None |
get_cluster(name, retry=<object object>, timeout=<object object>)
Gets details of specified cluster :param name: The name of the cluster to retrieve :type name: str :param retry: A retry object used to retry requests. If None is specified,
requests will not be retried.
Parameters: | timeout (float) – The amount of time, in seconds, to wait for the request to complete. Note that if retry is specified, the timeout applies to each individual attempt. |
---|---|
Returns: | A google.cloud.container_v1.types.Cluster instance |
get_operation(operation_name)
Fetches the operation from Google Cloud :param operation_name: Name of operation to fetch :type operation_name: str :return: The new, updated operation from Google Cloud
wait_for_operation(operation)
Given an operation, continuously fetches the status from Google Cloud until either completion or an error occurring :param operation: The Operation to wait for :type operation: A google.cloud.container_V1.gapic.enums.Operator :return: A new, updated operation fetched from Google Cloud