| 注册
home doc ppt pdf
请输入搜索内容

热门搜索

年终总结个人简历事迹材料租赁合同演讲稿项目管理职场社交

企业大数据基础平台搭建和实用开发代码

章***明

贡献于2021-07-08

字数:888529

目录
理解Hadoop HDFS 6
1 介绍 6
2 HDFS设计原 6
21 设计目标 7
22 系统架构容错性设计 7
23 HDFS适合应类型 7
3 HDFS核心概念 7
31 Blocks 7
32 Namenode & Datanode 7
33 Block Caching 8
34 HDFS Federation 8
35 HDFS HA(High Availability高性) 8
4 命令行接口 9
5 Hadoop文件系统 11
6 Java接口 13
61 读操作 13
62 写数 14
63 目录操作 14
64 删数 15
7 数流(读写流程) 15
71 读文件 15
72 写文件 16
73 致性模型 17
74 Hadoop节点距离 18
8 相关运维工具 18
81 distcp行复制 18
82 衡HDFS集群 19
9 HDFS机架感知概念配置实现 19
机架感知什? 19
二告诉呢? 19
三什情况会涉机架感知? 19
四机架感知需考虑情况(权衡性性带宽消耗) 19
五通什方式够告知Hadoop NameNode Slaves机器属Rack?配置步骤 19
六网络拓扑机器间距离 21
10 HDFS理界面 21
YARN 22
基架构 22
工作机制 22
ResourceManager 23
资源理 23
务调度 23
部结构 23
作业提交全程 24
资源调度器 25
务推测执行 26
YARNWEB UI说明 27
集群运行状态查 29
Hadoop 30 新特性 30
Hadoop Common 30
Hadoop HDFS 30
Hadoop MapReduce 31
Hadoop YARN 31
总结 31
Hive部表外部表区 34
概念理解 34
创建部表t1 34
装载数(t1) 35
创建外部表t2 37
装载数(t2) 37
查文件位置 37
观察HDFS文件 40
重新创建外部表t2 41
官网解释 42
Hive数仓库拉链表流水表全量表增量表 42
Hadoop 320 完全分布式集群搭建 45
集群环境搭建 46
二Hadoop配置修改 47
修改 hadoopenvsh 配置 jdk 路径定义集群操作户 47
修改 coresitexml hadoop核心配置 47
修改 hdfssitexml hadoop 节点配置 47
修改 workers 告知 hadoop hdfsDataNode节点 48
修改 yarnsitexml 配置yarn服务 48
修改mapredsitexml 文件 49
修改配置文件 分发 节点三台服务器 49
三Hadoop服务启动 49
四运行WordCount 52
Linux台载MySQL 53
yum安装MySQL 56
Hive环境搭建 61
安装 61
二配置理 61
三运行 64
Apache Mahout环境搭建 65
PySpark环境搭建 66
Linux台载Python 38 66
1赖安装 66
2载安装包 67
3解压 67
4安装 67
5添加软链接 67
6测试 67
Linux升级安装python38配置pipyum 67
Linux安装Python 38环境卸载旧Python 70
安装新版Python 2713Python 362(Python 2Python 3存修改默认版Python 362) 72
Linux安装Apache Spark 310详细步骤 73
Spark安装配置 76
Spark集群安装设置 79
Ubuntu 1204Hadoop 220 集群搭建 82
Ubuntu 1404安装Hadoop240(单机模式) 87
启动Spark集群 94
Spark性优化 99
1Spark作业基运行原理 101
2资源参数调优 101
numexecutors 101
executormemory 102
executorcores 102
drivermemory 102
sparkdefaultparallelism 102
sparkstoragememoryFraction 102
sparkshufflememoryFraction 102
3资源参数参考示例 103
4Spark中三种Join策略 103
Broadcast Hash Join 103
Shuffle Hash Join 104
Sort Merge Join 104
5Spark 30 中 AQE新特性 105
6数仓库中数优化般原 109
7Spark 中宽赖窄赖 109
概述 109
详细运行原理 110
8Spark算子 113
9Spark RDD 144
Spark RDD特性 144
Spark RDD核心特性 144
关系型数库数性优化解决方案分表(前表历史表)表分区数清理原 147
原目 147
数否需清理阀值判断 147
满负载周期判断 147
迁移周期判断 147
类型数分区方案 148
历史表清理方案 148
注意点 148
数仓库缓慢变化维(Slow changing demenison) 实现方案 149
MySQLTeradataPySpark代码互转表代码 151
PySpark代码基结构 196
PySparkMySQL导出数parquet文件 197
PySparkTeradata导出数parquet文件 197
PySparkParquent文件写入Hive表 198
PySpark读取HiveSQL查询数写入parquet文件 198
PySpark获取Dataframe采样数保存CSV文件 198
PySpark连接MySQL数库插入数 198
PySpark连接Teradata数库插入数 198
PySpark遍历Dataframe行 199
PySpark移动Parquet文件目录 200
PySpark复制Parquet文件目录 200
PySpark删Parquet文件目录 200
PySpark修改Hive指存储路径 200
PySpark显示HDFS路径文件 201
PySpark显示普通Hive表容量(GB) 201
PySpark显示Hive分区表容量(GB) 201
PySpark显示HDFS目录子目录容量 201
PySpark调SqoopHDFS导入Hive表 201
HiveQLparquet文件创建Hive表 202
HiveQLHive表创建Hive视图 202
HiveQL格式化显示Hive查询结果数 202
Hive导出Hive查询结果数CSV文件 202
HiveQL显示Hive表 202
HiveQL显示Hive数库 202
Shell带日期参数运行HQL脚 202
HiveQL更新视图指天表数 203
HiveQL修改Hive表指存储文件 203
Shell清HDFS里数 204
Shell查HDFS数 204
Sqoop显示MySQL中数库 204
SqoopMySQL数库导入HDFS 204
SqoopHDFS数库导入MySQL 206
Sqoop显示MySQL中数库 206
Teradata支持数类型 207
MySQL支持数类型 209
Hive支持数类型 209
Parquet文件存储格式 210
项目组成 210
数模型 211
StripingAssembly算法 212
Parquet文件格式 215
性 216
项目发展 218
总结 218
Apache Airflow文档 218
原理 221
导 221
Airflow环境安装配置 222
通安装方法 222
Airflow环境安装(Docker) 224
Airflow环境配置(Docker) 224
Airflow环境安装(Windows 10) 225
容 227
项目 227
许证 228
快速开始 231
教程 231
Operator教程 257
图形界面截图 301
概念 305
数分析 323
命令行参数 325
调度触发器 345
插件 348
安全性 351
时区 358
Experimental Rest API 361
整合 362
Lineage 432
常见问题 433
API 参考 436
Apache Airflow 20 新特性 600
TaskFlow API(AIP31) 种新编写dags方式 600
完全REST API(AIP32) 601
调度器性显著提升 602
调度器高兼容 (AIP15) 603
务组 (AIP34) 603
崭新户界面 603
减少传感器负载智传感器 (AIP17) 604
简化KubernetesExecutor 604
Airflow core(核心)providers(第三方安装包) Airflow 拆分 60 包: 604
安全性 605
配置 605
基Apache Airflow企业级数框架架构设计 605


理解Hadoop HDFS
文详细介绍HDFS中许概念理解Hadoop分布式文件系统帮助
1 介绍
现代企业环境中单机容量法存储量数需跨机器存储统理分布集群文件系统称分布式文件系统旦系统中引入网络避免引入网络编程复杂性例挑战果保证节点时候数丢失
传统网络文件系统(NFS)然称分布式文件系统存限制NFS中文件存储单机法提供性保证客户端时访问NFS Server时容易造成服务器压力造成性瓶颈外果NFS中文件进行操作需首先步修改步服务端前客户端见某种程度NFS种典型分布式系统然文件确放远端(单)服务器面


NFS协议栈事实种VFS(操作系统文件种抽象)实现
HDFSHadoop Distributed File System简称Hadoop抽象文件系统种实现Hadoop抽象文件系统系统Amazon S3等集成甚通Web协议(webhsfs)操作HDFS文件分布集群机器时提供副进行容错性保证例客户端写入读取文件直接操作分布集群机器没单点性压力
果零开始搭建完整集群参考[Hadoop集群搭建详细步骤(260)](httpblogcsdnnetbingduanlbdarticledetails51892750)
2 HDFS设计原
HDFS设计初非常明确应场景适什类型应适什应相明确指导原
21 设计目标
· HDFSHadoop核心项目现数领域事实存储标准高容错性设计运行量廉价商业硬件设计会假设前提
· 首先会首先假设硬件障通常现象说块磁盘障概率非常服务器集群数千台甚万台节点时候(磁盘障)非常司空见惯事情说快速发现定位问题快速做障恢复重设计目标
· 第二HDFS适合容量数流式访问样场景说数场景动辄百G甚T文件相低延时言会更加意批量处理高吞吐样诉求
· 第三形成理念移动计算代价移动数代价数领域普遍识Hadoop推出HDFS存储时候推出MapReduce计算框架出目
· 存储非常文件:里非常指百MG者TB级实际应中已集群存储数达PB级根Hadoop官网YahooHadoop集群约10万颗CPU运行4万机器节点更世界Hadoop集群情况参考Hadoop官网
· 采流式数访问方式 HDFS基样假设:效数处理模式次写入次读取数集常数源生成者拷贝次然做分析工作 
分析工作常读取中部分数全部 读取整数集需时间读取第条记录延时更重
· 运行商业硬件 Hadoop需特贵reliable()机器运行普通商机器(家供应商采购) 商机器代表低端机器集群中(尤集群)节点失败率较高HDFS目标确保集群节点失败时候会户感觉明显中断
22 系统架构容错性设计
HFDS典型MasterSlave架构里NameNode存储文件系统元数说分块文件路径等等类数BLOCK形式存储DataNode整体HDFS提供client客户提供文件系统命名空间操作说文件开关闭重命名移动等等
说HDFS存储文件提数副机制说里PART 0文件包括Block 1Block 3两Block时设置replica副数等213两Block会复制出外1外3存储外DataNode节点文件拆分两Block存储拥两副里面核心问题拆分数会存储方(DataNode)样调度策略问题Block副存节点(DataBlock)节点挂掉时候数丢失? HFDS解决问题呢?引入机架感知(Rack Awareness)样概念里引入两机架Rack 1Rack 2台机架会三样节点(DataNode)
23 HDFS适合应类型
场景适合HDFS存储数面列举:
1) 低延时数访问 
延时求毫秒级应适合采HDFSHDFS高吞吐数传输设计牺牲延时HBase更适合低延时数访问
2)量文件 
文件元数(目录结构文件block节点列表blocknode mapping)保存NameNode存中 整文件系统文件数量会受限NameNode存 
验言文件目录文件块般占150字节元数存空间果100万文件文件占1文件块需约300M存十亿级文件数量现商机器难支持
3)方读写需意文件修改 
HDFS采追加(appendonly)方式写入数支持文件意offset修改支持写入器(writer)
3 HDFS核心概念
31 Blocks
物理磁盘中块概念磁盘物理Block磁盘操作单元读写操作均Block单元般512 Byte文件系统物理Block抽象层概念文件系统Block物理磁盘Block整数倍通常KBHadoop提供dffsck类运维工具文件系统Block级进行操作
HDFSBlock块般单机文件系统默认128MHDFS文件拆分成blocksizedchunkchunk作独立单元存储Block文件会占整Block会占实际例 果文件1MHDFS中会占1M空间128M
HDFSBlock什? 
化查找(seek)时间控制定位文件传输文件时间例假设定位Block需时间10ms磁盘传输速度100Ms果定位Block时间占传输时间例控制1Block需约100M 
果Block设置MapReduce务中Map者Reduce务数 果集群机器数量会作业运行效率低
Block抽象处 
block拆分单文件整磁盘容量构成文件Block分布整集群 理单文件占集群中机器磁盘 
Block抽象简化存储系统Block需关注权限者等容(容文件级进行控制) 
Block作容错高机制中副单元Block单位进行复制
32 Namenode & Datanode
整HDFS集群NamenodeDatanode构成masterworker()模式Namenode负责构建命名空间理文件元数等Datanode负责实际存储数负责读写工作
Namenode
Namenode存放文件系统树文件目录元数元数持久化2种形式:
· namespcae image
· edit log
持久化数中包括Block节点列表文件Block分布集群中节点信息系统重启时候重新构建(通Datanode汇报Block信息) 
HDFS中Namenode成集群单点障Namenode时整文件系统HDFS针单点障提供2种解决机制: 
1)备份持久化元数 
文件系统元数时写文件系统 例时元数写文件系统NFS备份操作步原子
2)Secondary Namenode 
Secondary节点定期合Namenodenamespace imageedit log 避免edit log通创建检查点checkpoint合会维护合namespace image副 Namenode完全崩溃时恢复数图Secondary Namenode理界面:

Secondary Namenode通常运行台机器合操作需耗费量CPU存数落NamenodeNamenode完全崩溃时会出现数丢失 通常做法拷贝NFS中备份元数Second作新Namenode 
HA(High Availability高性)中运行Hot Standby作热备份Active Namenode障代原Namenode成Active Namenode
Datanode
数节点负责存储提取Block读写请求namenode直接客户端数节点周期性Namenode汇报节点存储Block相关信息
33 Block Caching
DataNode通常直接磁盘读取数频繁Block存中缓存默认情况Block数节点会缓存针文件性化配置 
作业调度器利缓存提升性例MapReduce务运行Block缓存节点 
户者应NameNode发送缓存指令(缓存文件缓存久) 缓存池概念理组缓存权限资源
34 HDFS Federation
知道NameNode存会制约文件数量HDFS Federation提供种横扩展NameNode方式Federation模式中NameNode理命名空间部分例NameNode理user目录文件 NameNode理share目录文件 
NameNode理namespace volumnvolumn构成文件系统元数NameNode时维护Block Pool保存Block节点映射等信息NameNode间独立节点失败会导致节点理文件 
客户端mount table文件路径映射NameNodemount tableNamenode群组封装层层Hadoop文件系统实现通viewfs协议访问
35 HDFS HA(High Availability高性)
HDFS集群中NameNode然单点障(SPOF Single Point Of Failure)元数时写文件系统Second NameNode定期checkpoint利保护数丢失提高性 
NameNode唯文件元数fileblock映射负责方 挂包括MapReduce作业法进行读写
NameNode障时常规做法元数备份重新启动NameNode元数备份源:
· 文件系统写入中备份
· Second NameNode检查点文件
启动新Namenode需重新配置客户端DataNodeNameNode信息外重启耗时般较久稍具规模集群重启常需十分钟甚数时造成重启耗时原致: 
1) 元数镜文件载入存耗时较长 
2) 需重放edit log 
3) 需收DataNode状态报告满足条件离开安全模式提供写服务
HadoopHA方案
采HAHDFS集群配置两NameNode分处ActiveStandby状态Active NameNode障Standby接责继续提供服务户没明显中断感觉般耗时十秒数分钟 
HA涉实现逻辑
1) 备需享edit log存储 
NameNode命NameNode享份edit log备切换时Standby通回放edit log步数 
享存储通常2种选择
· NFS:传统网络文件系统
· QJM:quorum journal manager
QJM专门HDFSHA实现设计提供高edit logQJM运行组journal nodeedit log必须写部分journal nodes通常3节点允许节点失败类似ZooKeeper注意QJM没ZK然HDFS HA确ZK选举Namenode般推荐QJM
2)DataNode需时备发送Block Report 
Block映射数存储存中(磁盘)Active NameNode挂掉新NameNode够快速启动需等DatanodeBlock ReportDataNode需时备两NameNode发送Block Report
3)客户端需配置failover模式(失效备援模式户透明) 
Namenode切换客户端说感知通客户端库实现客户端配置文件中HDFS URI逻辑路径映射Namenode址客户端会断尝试Namenode址直成功
4)Standby代Secondary NameNode 
果没启HAHDFS独立运行守护进程作Secondary Namenode定期checkpoint合镜文件edit日志
果Namenode失败时备份Namenode正关机(停止 Standby)运维员然头启动备份Namenode样没HA时候更省事算种改进重启整程已标准化Hadoop部需运维进行复杂切换操作
NameNode切换通代failover controller实现failover controller种实现默认实现ZooKeeper保证Namenode处active状态
Namenode运行轻量级failover controller进程该进程简单心跳机制监控Namenode存活状态Namenode失败时触发failoverFailover运维手动触发例日常维护中需切换Namenode种情况graceful(优雅) failover非手动触发failover称ungraceful failover
ungraceful failover情况没办法确定失败(判定失败)节点否停止运行说触发failover前Namenode运行QJM次允许Namenode写edit log前Namenode然接受读请求Hadoopfencing杀掉前NamenodeFencing通收回前Namenode享edit log访问权限关闭网络端口原Namenode继续接受服务请求STONITH技术前Namenode关机
HA方案中Namenode切换客户端说见前面已介绍通客户端库完成
4 命令行接口
HDFS提供种交互方式例通Java APIHTTPshell命令行命令行交互通hadoop fs操作例:
1 hadoop fs copyFromLocal 复制文件HDFS
2 hadoop fs mkdir 创建目录
3 hadoop fs ls 列出文件列表
Hadoop中文件目录权限类似POSIX模型包括读写执行3种权限:
· 读权限(r):读取文件者列出目录中容
· 写权限(w):文件文件写权限目录写权限指该目录创建者删文件(目录)权限
· 执行权限(x):文件没谓执行权限忽略目录执行权限访问器目录容
文件目录ownergroupmode三属性owner指文件者group权限组mode 者权限文件属组中组员权限非者非组员权限组成图表示者root拥读写权限supergroup组组员读权限读权限

文件权限否开启通dfspermissionsenabled属性控制属性默认false没开安全限制会客户端做授权校验果开启安全限制会操作文件户做权限校验特殊户superuserNamenode进程标识会针该户做权限校验
ls命令执行结果:

返回结果类似Unix系统ls命令第栏文件moded表示目录紧接着3种权限9位 第二栏指文件副数数量通dfsreplication配置目录表示没副说诸者组更新时间文件Unix系统中ls命令致
果需查集群状态者浏览文件目录访问Namenode暴露Http Server查集群信息般namenode机器50070端口



5 Hadoop文件系统
前面Hadoop文件系统概念抽象HDFS中种实现Hadoop提供实现图:


简单介绍Local文件系统抽象hdfs常见两种web形式(webhdfsswebhdfs)实现通HTTP提供文件操作接口harHadoop体系压缩文件文件时候压缩成文件效减少元数数量viewfs前面介绍HDFS Federation张提客户端屏蔽Namenode底层细节ftp顾名思义ftp协议实现文件操作转化ftp协议s3aAmazon云服务提供存储系统实现azure微软云服务台实现
前面提命令行HDFS交互事实方式操作文件系统例Java应程序orgapachehadoopfsFileSystem操作形式操作基FileSystem进行封装里介绍HTTP交互方式 
WebHDFSSWebHDFS协议文件系统暴露HTTP操作种交互方式原生Java客户端慢适合操作文件通HTTP2种访问方式直接访问通代理访问
直接访问 
直接访问示意图:

NamenodeDatanode默认开嵌入式web serverdfswebhdfsenabled默认truewebhdfs通服务器交互元数操作通namenode完成文件读写首先发namenode然重定datanode读取(写入)实际数流
通HDFS代理

采代理示意图示 代理处通代理实现负载均衡者带宽进行限制者防火墙设置代理通HTTP者HTTPS暴露WebHDFS应webhdfsswebhdfs URL Schema
代理作独立守护进程独立namenodedatanodehttpfssh脚默认运行14000端口
FileSystem直接操作命令行HTTTP外C语言APINFSFUSER等方式里做介绍
6 Java接口
实际应中HDFS数操作通FileSystem操作部分重点介绍相关接口关注HDFS实现类DistributedFileSystem相关类
61 读操作
URL读取数者直接FileSystem操作
Hadoop URL读取数
javanetURL类提供资源定位统抽象定义种URL Schema提供相应处理类进行实际操作hdfs schema便样种实现
1 InputStream in null
2 try {
3 in new URL(hdfsmasteruserhadoop)openStream()
4 }finally{
5 IOUtilscloseStream(in)
6 }
定义Schema需设置URLStreamHandlerFactory操作JVM进行次次操作会导致通常静态块中完成面截图示例:


FileSystem API读取数
1) 首先获取FileSystem实例般静态get工厂方法
1 public static FileSystem get(Configuration conf) throws IOException
2 public static FileSystem get(URI uri Configuration conf) throws IOException
3 public static FileSystem get(URI uri Configuration confString user) throws IOException
果文件通getLocal获取文件系统象:
public static LocalFileSystem getLocal(COnfiguration conf) thrown IOException
2)调FileSystemopen方法获取输入流
1 public FSDataInputStream open(Path f) throws IOException
2 public abstarct FSDataInputStream open(Path f int bufferSize) throws IOException
默认情况open4KBBuffer根需行设置
3)FSDataInputStream进行数操作 
FSDataInputStreamjavaioDataInputStream特殊实现基础增加机读取部分读取力
1 public class FSDataInputStream extends DataInputStream
2 implements Seekable PositionedReadable
3 ByteBufferReadable HasFileDescriptor CanSetDropBehind CanSetReadahead
4 HasEnhancedByteBufferAccess
机读取操作通Seekable接口定义:
1 public interface Seekable {
2 void seek(long pos) throws IOException
3 long getPos() throws IOException
4 }
seek操作开销昂贵慎
部分读取通PositionedReadable接口定义:
1 public interface PositionedReadable{
2 public int read(long pistion byte[] bufferint offser int length) throws IOException
3 public int readFully(long pistion byte[] bufferint offser int length) throws IOException
4 public int readFully(long pistion byte[] buffer) throws IOException
5 }
62 写数
HDFS中文件FileSystem类create方法重载形式创建create方法返回输出流FSDataOutputStream调返回输出流getPos方法查前文件位移进行seek操作HDFS仅支持追加操作
创建时传递回调接口Peofressable获取进度信息
append(Path f)方法追加容已文件实现提供该方法例Amazon文件实现没提供追加功
面例子:
1 String localSrc args[0]
2 String dst args[1]
3
4 InputStream in new BufferedInputStream(new FileInputStream(localSrc))
5
6 Configuration conf new Configuration()
7 FileSystem fs FileSystemget(URIcreate(dst)conf)
8
9 OutputStream out fscreate(new Path(dst) new Progressable(){
10 public vid progress(){
11 Systemoutprint()
12 }
13 })
14
15 IOUtilscopyBytes(in out 4096true)
63 目录操作
mkdirs()方法会动创建没级目录
HDFS中元数封装FileStatus类中包括长度block sizereplicaions修改时间者权限等信息FileSystem提供getFileStatus方法获取FileStatusexists()方法判断文件者目录否存
列出文件(list)listStatus方法查文件者目录信息
1 public abstract FileStatus[] listStatus(Path f) throws FileNotFoundException
2 IOException
Path文件时候返回长度1数组FileUtil提供stat2Paths方法FileStatus转化Path象
globStatus通配符文件路径进行匹配:
public FileStatus[] globStatus(Path pathPattern) throws IOException
· 1
PathFilter定义文件名滤根文件属性进行滤类似javaioFileFilter例面例子排定正表达式文件:
1 public interfacePathFilter{
2 boolean accept(Path path)
3 }
64 删数
FileSystemdelete()方法
public boolean delete(Path f boolean recursive) throws IOException
· 1
recursive参数f文件时候忽略果f文件recursicetrue删整目录否抛出异常
7 数流(读写流程)
接详细介绍HDFS读写数流程致性模型相关概念
71 读文件
致读文件流程:

1)客户端传递文件PathFileSystemopen方法
2)DFS采RPC远程获取文件开始blockdatanode址Namenode会根网络拓扑结构决定返回节点(前提节点block副)果客户端身Datanode节点刚block副直接读取
3)客户端open方法返回FSDataInputStream象读取数(调read方法)
4)DFSInputStream(FSDataInputStream实现改类)连接持第block节点反复调read方法读取数
5)第block读取完毕寻找block佳datanode读取数果必DFSInputStream会联系Namenode获取批Block 节点信息(存放存持久化)寻址程客户端见
6)数读取完毕客户端调close方法关闭流象
读数程中果Datanode通信发生错误DFSInputStream象会尝试佳节点读取数记住该失败节点 续Block读取会连接该节点 
读取BlockDFSInputStram会进行检验验证果Block损坏尝试节点读取数损坏block汇报Namenode 
客户端连接datanode获取数namenode指导样支持量发客户端请求namenode流量均匀分布整集群 
Block位置信息存储namenode存中相应位置请求非常高效会成瓶颈
72 写文件

步骤分解 
1)客户端调DistributedFileSystemcreate方法
2)DistributedFileSystem远程RPC调Namenode文件系统命名空间中创建新文件时该文件没关联block 程中Namenode会做校验工作例否已存名文件否权限果验证通返回FSDataOutputStream象 果验证通抛出异常客户端
3)客户端写入数时候DFSOutputStream分解packets(数包)写入数队列中该队列DataStreamer消费
4)DateStreamer负责请求Namenode分配新block存放数节点节点存放Block副构成道 DataStreamerpacket写入道第节点第节点存放packet转发节点节点存放 继续传递
5)DFSOutputStream时维护ack queue队列等datanode确认消息道datanode确认packetack队列中移
6)数写入完毕客户端close输出流packet刷新道中然安心等datanode确认消息全部确认告知Namenode文件完整 Namenode时已知道文件Block信息(DataStreamer请求Namenode分配block)需等达副数求然返回成功信息客户端
Namenode决定副存Datanode?
HDFS副存放策略性写带宽读带宽间权衡默认策略:
· 第副放客户端相机器果机器集群外机选择(会选择容量太慢者前操作太繁忙)
· 第二副机放第副机架
· 第三副放第二副机架节点满足条件节点中机选择
· 更副整集群机选择然会量避免太副机架 
副位置确定建立写入道时候会考虑网络拓扑结构面存放策略:

样选择滴衡性读写性
· 性:Block分布两机架
· 写带宽:写入道程需跨越交换机
· 读带宽:两机架中选读取
73 致性模型
致性模型描述文件系统中读写操见性HDFS中文件旦创建文件系统命名空间中见:
1 Path p new Path(p)
2 fscreate(p)
3 assertTaht(fsexists(p)is(true))
写入文件容保证见象流已刷新 
`java 
Path p new Path(p) 
OutputStream out fscreate(p) 
outwrite(contentgetBytes(UTF8)) 
outflush() 
assertTaht(fsgetFileStatus(p)getLen0L) 0调flush
1
2 果需强制刷新数DatanodeFSDataOutputStreamhflush方法强制缓刷datanode
3 hflushHDFS保证时间点止写入文件数达数节点
4 ```java
5 Path p new Path(p)
6 OutputStream out fscreate(p)
7 outwrite(contentgetBytes(UTF8))
8 outflush()
9 assertTaht(fsgetFileStatus(p)getLenis(((longcontentlength())))
关闭象流时部会调hflush方法hflush保证datanode数已写入磁盘保证写入datanode存 机器断电时候导致数丢失果保证写入磁盘hsync方法hsync类型fsync()系统调fsync提交某文件句柄缓数
1 FileOutputStreamout new FileOutPutStream(localFile)
2 outwrite(contentgetBytes(UTF8))
3 outflush()
4 outgetFD()sync()
5 assertTaht(localFilegetLenis(((longcontentlength())))
hflushhsync会导致吞吐量降设计应时需吞吐量数健壮性间做权衡
外文件写入程中前正写入BlockReader见
74 Hadoop节点距离
读取写入程中namenode分配Datanode时候会考虑节点间距离HDFS中距离没 
采带宽衡量实际中难准确度量两台机器间带宽 
Hadoop机器间拓扑结构组织成树结构达公父节点需跳转数作距离事实距离矩阵例子面例子简明说明距离计算:

数中心机架节点距离0
数中心机架节点距离2
数中心机架节点距离4
数中心机架节点距离6

Hadoop集群拓扑结构需手动配置果没配置Hadoop默认节点位数中心机架
8 相关运维工具
81 distcp行复制
前面关注点单线程访问果需行处理文件需编写应Hadoop提供distcp工具行导入数Hadoop者Hadoop导出例子:
1 hadoop distcp file1 file2 作fs cp命令高效代
2 hadoop distcp dir1 dir2
3 hadoop distcp update dir1 dir2 #update参数表示步更新文件保持变
distcp底层MapReduce实现map实现没reducemap中行复制文件 distcpmap间均分配文件map数量通m参数指定
hadoop distcp update delete p hdfsmaster19000foo hdfsmaster2foo
样操作常两集群间复制数update参数表示步更新数delete会删目标目录中存源目录存文件p参数表示保留文件全校block副数量等属性
果两集群Hadoop版兼容webhdfs协议:
hadoop distcp webhdfsnamenode150070foo webhdfsnamenode250070foo
82 衡HDFS集群
distcp工具中果指定map数量1仅速度慢Block第副全部落运行唯map节点直磁盘溢出distcp时候默认map数量20
HDFSBlock均匀分布节点时候工作果没办法作业中量保持集群衡例限制map数量(便节点作业)balancer工具调整集群Block分布
9 HDFS机架感知概念配置实现
机架感知什?
告诉 Hadoop 集群中台机器属机架
二告诉呢?
Hadoop机架感知非适应Hadoop集群分辨某台Slave 机器属Rack非智感知需 Hadoop理者告知 Hadoop台机器属Rack样HadoopNameNode启动初始化时会机器 rack 应信息保存存中作接 HDFS 写块操作分配 datanode
列表时( 3 block 应三台 datanode)选择 datanode 策略量三副分布 rack
三什情况会涉机架感知?
Hadoop 集群规模情况
四机架感知需考虑情况(权衡性性带宽消耗)
(1)节点间通信够量发生机架跨机架
(2)提高容错力NameNode会数块副放机架
五通什方式够告知Hadoop NameNode Slaves机器属Rack?配置步骤
1默认情况Hadoop机架感知没启通常情况Hadoop集群 HDFS 选机器时候机选择说写数时Hadoop第块数 block1写rack1然机选择block2 写入rack2 时两Rack间产生数传输流量接机情况block3 重新写回 rack1时两Rack间产生次数流量Job处理数量非常者Hadoop推送数量非常时候种情况会造成 Rack间网络流量成倍升成性瓶颈进影响作业性整集群服务
Hadoop机架感知功启配置非常简单NameNode机器 hadoopsitexml 配置文件中配置选项:
topologyscriptfilenamepathtoRackAwarepy
配置选项 value 指定执行程序通常脚该脚接受参数输出值接受参数通常某台 DataNode机器IP址输出值通常该IP址应DataNodeRack例rack1NameNode启动时会判断该配置选项否空果非空表示已机架感知配置时NameNode会根配置寻找该脚接收 DataNodeHeartbeat时该 DataNodeIP址作参数传该脚运行输出作该DataNode属机架保存存Map中
脚编写需真实网络拓朴机架信息解清楚通该脚够机器IP址正确映射相应机架
简单实现:
#usrbinpython
#codingUTF8 –
import sys
rack {hadoopnode176tjrack1
hadoopnode178tjrack1
hadoopnode179tjrack1
hadoopnode180tjrack1
hadoopnode186tjrack2
hadoopnode187tjrack2
hadoopnode188tjrack2
hadoopnode190tjrack2
192168115rack1
192168117rack1
192168118rack1
192168119rack1
192168125rack2
192168126rack2
192168127rack2
192168129rack2
}
if namemain
print( + rackget(sysargv[1] rack0)
没确切文档说明 底机名 ip 址会传入脚脚中兼容机名 ip 址果机房架构较复杂话脚返回:dc1rack1 类似字符串
执行命令:chmod +x RackAwarepy
重启NameNode果配置成功NameNode启动日志中会输出:
INFO orgapachehadoopnetNetworkTopology Adding a new node rack119216811550010
六网络拓扑机器间距离
里基网络拓扑案例介绍复杂网络拓扑中Hadoop集群台机器间距离

机架感知NameNode 画出图示DataNode网络拓扑图D1R1 交换机底层 datanode H1
rackidD1R1H1H1 parent R1R1 D1 rackid信息通 topologyscriptfilename 配置 rackid 信息计算出意两台DataNode间距离
1 distance(D1R1H1D1R1H1)0 相DataNode
2 distance(D1R1H1D1R1H2)2 RackDataNode
3 distance(D1R1H1D1R1H4)4 IDC(互联网数中心(机房))DataNode
4 distance(D1R1H1D2R3H7)6 IDCDataNode
10 HDFS理界面
里HDFS理情况hadoop001开浏览器进入HDFS理界面输入:19216821612850070

点击DataNodes


YARN
Yarn资源调度台负责运算程序提供服务器运算资源相分布式操作系统台MapReduce等运算程序相运行操作系统应程序
基架构
  YARNResourceManagerNodeManagerApplicationMasterContainer等组件构成

工作机制
1)运行机制

2)工作机制详解
(0)MapReduce程序提交客户端节点
(1)Yarn RunnerResourceManager申请Application
(2)ResourceManager该应程序资源路径返回Yarn Runner
(3)该程序运行需资源提交HDFS
(4)程序资源提交完毕申请运行MapReduce Application Master
(5)RM户请求初始化成Task
(6)中NodeManager领取Task务
(7)该NodeManager创建容器Container产生MapReduce Application Master
(8)ContainerHDFS拷贝资源
(9)MapReduce Application MasterResourceManager申请运行Map Task资源
(10)ResourceManager运行Map Task务分配外两NodeManager两NodeManager分领取务创建容器
(11)MapReduce两接收务NodeManager发送程序启动脚两NodeManager分启动Map TaskMap Task数分区排序
(12)MapReduce Application Master等Map Task运行完毕RM申请容器运行Reduce Task
(13)Reduce TaskMap Task获取相应分区数
(14)程序运行完毕MapReduce会ResourceManager申请注销
ResourceManager
负责全局资源理务调度整集群成计算资源池关注分配应负责容错
资源理
1 前资源节点分成Map slotReduce slot现ContainerContainer根需运行ApplicationMasterMapReduce者意程序
2 前资源分配静态目前动态资源利率更高
3 Container资源申请单位资源申请格式: resourcename:机名机架名*(代表意机器) resourcerequirement:目前支持CPU存
4 户提交作业ResourceManager然某NodeManager分配Container运行ApplicationMasterApplicationMaster根身程序需ResourceManager申请资源
5 YARN套Container生命周期理机制ApplicationMasterContainer间理应程序定义
务调度
1 关注资源情况根需求合理分配资源
2 Scheluer根申请需特定机器申请特定资源(ApplicationMaster负责申请资源时数化考虑ResourceManager量满足申请需求指定机器分配Container减少数移动)
部结构

· Client Service 应提交终止输出信息(应队列集群等状态信息)
· Adaminstration Service 队列节点Client权限理
· ApplicationMasterService 注册终止ApplicationMaster 获取ApplicationMaster资源申请取消请求异步传Scheduler 单线程处理
· ApplicationMaster Liveliness Monitor 接收ApplicationMaster心跳消息果某ApplicationMaster定时间没发送心跳务失效资源会回收然ResourceManager会重新分配ApplicationMaster运行该应(默认尝试2次)
· Resource Tracker Service 注册节点 接收注册节点心跳消息
· NodeManagers Liveliness Monitor 监控节点心跳消息果长时间没收心跳消息认该节点效 时该节点Container标记成效会调度务该节点运行
· ApplicationManager 理应程序记录理已完成应
· ApplicationMaster Launcher 应提交负责NodeManager交互分配Container加载ApplicationMaster负责终止销毁
· YarnScheduler 资源调度分配 FIFO(with Priority)FairCapacity方式
· ContainerAllocationExpirer 理已分配没启Container超定时间回收
作业提交全程
1)作业提交程YARN

作业提交全程详解
(1)作业提交
第0步:client调jobwaitForCompletion方法整集群提交MapReduce作业
第1步:clientRM申请作业id
第2步:RMclient返回该job资源提交路径作业id
第3步:client提交jar包切片信息配置文件指定资源提交路径
第4步:client提交完资源RM申请运行MrAppMaster
(2)作业初始化
第5步:RM收client请求该job添加容量调度器中
第6步:某空闲NM领取该job
第7步:该NM创建Container产生MRAppmaster
第8步:载client提交资源
(3)务分配
第9步:MrAppMasterRM申请运行maptask务资源
第10步:RM运行maptask务分配外两NodeManager两NodeManager分领取务创建容器
(4)务运行
第11步:MR两接收务NodeManager发送程序启动脚两NodeManager分启动maptaskmaptask数分区排序
第12步:MrAppMaster等maptask运行完毕RM申请容器运行reduce task
第13步:reduce taskmaptask获取相应分区数
第14步:程序运行完毕MR会RM申请注销
(5)进度状态更新
YARN中务进度状态(包括counter)返回应理器 客户端秒(通mapreduceclientprogressmonitorpollinterval设置)应理器请求进度更新 展示户
(6)作业完成
应理器请求作业进度外 客户端5分钟会通调waitForCompletion()检查作业否完成时间间隔通mapreduceclientcompletionpollinterval设置作业完成 应理器container会清理工作状态作业信息会作业历史服务器存储备户核查
2)作业提交程MapReduce

资源调度器
目前Hadoop作业调度器三种:FIFOCapacity SchedulerFair SchedulerHadoop272默认资源调度器Capacity Scheduler
具体设置详见:yarndefaultxml文件

    The class to use as the resource scheduler
    yarnresourcemanagerschedulerclass
orgapachehadoopyarnserverresourcemanagerschedulercapacityCapacityScheduler

1)先进先出调度器(FIFO)

2)容量调度器(Capacity Scheduler)

3)公调度器(Fair Scheduler)

务推测执行
1)作业完成时间取决慢务完成时间
作业干Map务Reduce务构成硬件老化软件Bug等某务运行非常慢
典型案例:系统中99Map务完成少数Map老进度慢完成办?
2)推测执行机制:
发现拖腿务某务运行速度远慢务均速度拖腿务启动备份务时运行谁先运行完采谁结果
3)执行推测务前提条件
(1)task备份务
(2)前job已完成task必须005(5)
(3)开启推测执行参数设置Hadoop272 mapredsitexml文件中默认开

  mapreducemapspeculative
  true
  If true then multiple instances of some map tasks                may be executed in parallel

 

  mapreducereducespeculative
  true
  If true then multiple instances of some reduce tasks
               may be executed in parallel

4)启推测执行机制情况
   (1)务间存严重负载倾斜
   (2)特殊务务数库中写数
5)算法原理

YARNWEB UI说明
安装完Yarn浏览器中通httpmaster8088访问YarnWEB UI图:

详细解释图中标记1(cluster)2(Nodes)两界面中资源关信息

面7字段信息进行解释:
1Active Nodes:表示Yarn集群理节点数实NodeManager数集群2NodeManager
2Memory Total:表示Yarn集群理存总存总等NodeManager理存NodeManager理存通yarnsitexml中配置进行配置:

yarnnodemanagerresourcememorymb
1630
表示NodeManager理存

配置中NodeManager理存1630MB整Yarn集群理存总1630MB * 2 3260MB约等318GBMemory Total
3Vcores Total:表示Yarn集群理cpu虚拟核心总数等NodeManager理虚拟核心NodeManager理虚拟核心数通yarnsitexml中配置进行配置

yarnnodemanagerresourcecpuvcores
2
表示NodeManager理虚拟核心数

配置中NodeManager理虚拟核心数2整Yarn集群理虚拟核心总数2 * 2 4Vcores Total
4Scheduler Type:表示资源分配类型Hadoopyarn安装文章中说三中资源调度
5Minimum Allocation:分配资源说务Yarn申请资源时候Yarn少会分配资源务分配存核心数分配置yarnschedulerminimumallocationmb(默认值1024MB)yarnschedulerminimumallocationvcores(默认值1)控制
6Maximum Allocation:分配资源说务Yarn申请资源时候Yarn会分配资源务分配存核心数分配置yarnschedulermaximumallocationmb(默认值8192MB)yarnschedulermaximumallocationvcores(默认值32)控制然两值肯定集群理资源

面Yarn集群理两NodeManager状态信息分:
1Rack:表示NodeManager机器机架
2Node State:表示NodeManager状态
3Mem Used:表示NodeManager已存Mem Avail:表示NodeManager剩少存VCores Used:表示NodeManager已VCores数量VCores Avail:表示NodeManager剩少VCores数量
点击Node Address

进入界面:

界面信息slave2NodeManager详细信息中Total Vmem allocated for Containers表示NodeManager理虚拟存虚拟存yarnsitexml中配置设置:

yarnnodemanagervmempmemratio
41
表示NodeManager理虚拟存物理存例

面配置yarnnodemanagervmempmemratio虚拟存物理存例41说虚拟存物理存41倍虚拟存1630MB * 41 6683MB约等653GB
集群运行状态查


注:般资源超配置资源话  Staday Fair Shar mem  Min Resources mem

发生然Staday Fair Shar mem  Min Resources mem

暂时未遇单列超max资源配情况Staday Fair Shar mem  Min Resources mem情况


Hadoop 30 新特性
Hadoop 30功性方面hadoop核进行项重改进包括:
Hadoop Common
(1)精简Hadoop核包括剔期API实现默认组件实现换成高效实现(FileOutputCommitter缺省实现换v2版废hftp转webhdfs代移Hadoop子实现序列化库orgapachehadoopRecords
(2)Classpath isolation防止版jar包突google Guava混合HadoopHBaseSpark时容易产生突(httpsissuesapacheorgjirabrowseHADOOP11656)
(3)Shell脚重构 Hadoop 30Hadoop理脚进行重构修复量bug增加新特性支持动态命令等httpsissuesapacheorgjirabrowseHADOOP9902
Hadoop HDFS
(1)HDFS支持数擦编码HDFS降低性前提节省半存储空间(httpsissuesapacheorgjirabrowseHDFS7285)
(2)NameNode支持支持集群中activestandby namenode部署方式注:ResourceManager特性hadoop 20中已支持(httpsissuesapacheorgjirabrowseHDFS6440)
Hadoop MapReduce
(1)Tasknative优化MapReduce增加CC++map output collector实现(包括SpillSortIFile等)通作业级参数调整切换该实现shuffle密集型应性提高约30(httpsissuesapacheorgjirabrowseMAPREDUCE2841)
(2)MapReduce存参数动推断Hadoop 20中MapReduce作业设置存参数非常繁琐涉两参数:mapreduce{mapreduce}memorymbmapreduce{mapreduce}javaopts旦设置合理会存资源浪费严重前者设置4096MB者Xmx2g剩余2g实际法java heap(httpsissuesapacheorgjirabrowseMAPREDUCE5785)
Hadoop YARN
(1)基cgroup存隔离IO Disk隔离(httpsissuesapacheorgjirabrowseYARN2619)
(2)curator实现RM leader选举(httpsissuesapacheorgjirabrowseYARN4438)
(3)containerresizing(httpsissuesapacheorgjirabrowseYARN1197)
(4)Timelineserver next generation (httpsissuesapacheorgjirabrowseYARN2928)
hadoop30新参数
hadoop30
HADOOP
Move to JDK8+
Classpath isolation on by default HADOOP11656
Shell script rewrite HADOOP9902
Move default ports out of ephemeral range HDFS9427
HDFS
Removal of hftp in favor of webhdfs HDFS5570
Support for more than two standby NameNodes HDFS6440
Support for Erasure Codes in HDFS HDFS7285
YARN
MAPREDUCE
Derive heap size or mapreduce*memorymb automatically MAPREDUCE5785
HDFS7285中实现Erasure Coding新功鉴功远没发布阶段面块相关代码会进行进步改造做谓预分析帮助家提前解Hadoop社区目前实现功前没接触Erasure Coding技术中间程确实偶然相信文带家收获
Erasure coding纠删码技术简称EC种数保护技术早通信行业中数传输中数恢复种编码容错技术通原始数中加入新校验数部分数产生关联性定范围数出错情况通纠删码技术进行恢复面结合图片进行简单演示首先原始数n然加入m校验数块图示
Parity部分校验数块行数块组成Stripe条带行条带n数块m校验块组成原始数块校验数块通现数块进行恢复原
果校验数块发生错误通原始数块进行编码重新生成果原始数块发生错误 通校验数块解码重新生成
mn值固定变进行相应调整会奇中底什原理呢 实道理简单面图成矩阵矩阵运算具逆性数进行恢复出张标准矩阵相图家二者关联
总结
Hadoop 30alpha版预计2016夏天发布GA版11月12月发布
Hadoop 30中引入重功优化包括HDFS 擦编码Namenode支持MR Native Task优化YARN基cgroup存磁盘IO隔离YARN container resizing等

 

 

 

 
        相前生产发布版Hadoop 2Apache Hadoop 3整合许重增强功 Hadoop 3版提供稳定性高质量API实际产品开发面简介绍Hadoop3变化
· 低Java版求Java7变Java8
    Hadoopjar基Java 8运行版进行编译执行Java 7更低Java版户需升级Java 8
· HDFS支持纠删码(erasure coding)
    纠删码种副存储更节省存储空间数持久化存储方法ReedSolomon(104)标准编码技术需14倍空间开销标准HDFS副技术需3倍空间开销纠删码额外开销重建远程读写通常存储常数(冷数)外新特性时户需考虑网络CPU开销
· YARN时间线服务 v2(YARN Timeline Service v2)
    YARN Timeline Service v2应两挑战:(1)提高时间线服务扩展性性(2)通引入流(flow)聚合(aggregation)增强性代Timeline Service v1xYARN Timeline Service v2 alpha 2提出样户开发者进行测试提供反馈建议YARN Timeline Service v2测试容器中
· 重写Shell脚
    Hadoopshell脚重写修补许长期存bug增加新特性
· 覆盖客户端jar(Shaded client jars)
    2x版中hadoopclient Maven artifact配置会拉取hadoop传递赖hadoop应程序环境变量回带传递赖版应程序版相突问题
    HADOOP11804 添加新 hadoopclientapihadoopclientruntime artifcathadoop赖隔离单Jar包中避免hadoop赖渗透应程序类路径中
· 支持Opportunistic ContainersDistributed Scheduling
    ExecutionType概念引入样应够通Opportunistic执行类型请求容器调度时没资源种类型容器会分发NM中执行程序种情况容器放入NM队列中等资源便执行Opportunistic container优先级默认Guaranteedcontainer低需情况资源会抢占便Guaranteed container样需提高集群率
    Opportunistic container默认中央RM分配目前已增加分布式调度器支持该分布式调度器做AMRProtocol解析器实现
· MapReduce务级优化
    MapReduce添加映射输出收集器化实现支持密集型洗牌操作(shuffleintensive)jobs带30性提升
· 支持余2NameNodes
    针HDFS NameNode高性初实现方式提供活跃(active)NameNode备(Standby)NameNode通3JournalNode法定数量复制编辑种架构够系统中节点障进行容错
    该功够通运行更备NameNode提供更高容错性满足部署需求通配置3NameNode5JournalNode集群够实现两节点障容错
· 修改重服务默认端口
    前Hadoop版中重Hadoop服务默认端口Linux时端口范围容(3276861000)意味着启动程中服务器端口突会启动失败突端口已时端口范围移NameNodeSecondary NameNodeDataNodeKMS会受影响文档已做相应修改通阅读发布说明 HDFS9427HADOOP12811详细解修改端口
· 提供文件系统连接器(filesystem connnector)支持Microsoft Azure Data LakeAliyun象存储系统
    Hadoop支持Microsoft Azure Data LakeAliyun象存储系统集成作Hadoop兼容文件系统
· 数节点置衡器(Intradatanode balancer)
    单DataNode理磁盘情况执行普通写操作时磁盘量较均添加者更换磁盘时会导致DataNode磁盘量严重均衡目前HDFS均衡器关注点DataNode间(inter)intra处理种均衡情况
    hadoop3 中通DataNode部均衡功已处理述情况通hdfs diskbalancer ClI调
· 重写守护进程务堆理机制
    针Hadoop守护进程MapReduce务堆理机制Hadoop3 做系列修改
    HADOOP10950 引入配置守护进程堆新方法特HADOOP_HEAPSIZE配置方式已弃根机存进行动调整
    MAPREDUCE5785 简化MAP配置减少务堆需务配置Java选项中明确指出需堆已明确指出堆现配置会受该改变影响
· S3GuradS3A文件系统客户端提供致性元数缓存
    HADOOP13345 亚马逊S3存储S3A客户端提供选特性:够DynamoDB表作文件目录元数快速致性存储
· HDFS基路器互联(HDFS RouterBased Federation)
    HDFS RouterBased Federation添加RPC路层HDFS命名空间提供联合视图现ViewFsHDFS Federation功类似区通服务端理表加载原客户端理简化现存HDFS客户端接入federated cluster操作
· 基API配置Capacity Scheduler queue configuration
    OrgQueue扩展capacity scheduler提供种编程方法该方法提供REST API修改配置户通远程调修改队列配置样队列administer_queue ACL理员实现动化队列配置理
· YARN资源类型
    Yarn资源模型已般化支持户定义计算资源类型仅仅CPU存集群理员定义GPU数量软件序列号连接存储资源然Yarn务够资源进行调度


Hive部表外部表区
未external修饰部表(managed table)external修饰外部表(external table)
区:
部表数Hive身理外部表数HDFS理
部表数存储位置hivemetastorewarehousedir(默认:userhivewarehouse)外部表数存储位置制定(果没LOCATIONHiveHDFSuserhivewarehouse文件夹外部表表名创建文件夹属表数存放里)
删部表会直接删元数(metadata)存储数删外部表仅仅会删元数HDFS文件会删
部表修改会修改直接步元数外部表表结构分区进行修改需修复(MSCK REPAIR TABLE table_name)
进行试验进行理解
概念理解
创建部表t1
create table t1(
id int
name string
hobby array
add map
)
row format delimited
fields terminated by ''
collection items terminated by ''
map keys terminated by ''


2 查表描述:desc t1

装载数(t1)
注:般少insert (insert overwrite)语句算算插入条数会调MapReduce里选择Load Data方式
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1val1 partcol2val2 )]
创建文件粘贴述记录载图:

文件容
1xiaomingbookTVcodebeijingchaoyangshagnhaipudong
2lileibookcodenanjingjiangningtaiwantaibei
3lihuamusicbookheilongjianghaerbin
然载
load data local inpath 'homehadoopDesktopdata' overwrite into table t1
忘记写文件名data笔者第次忘记写整Desktop传查全null乱码
查表容:
select * from t1

创建外部表t2
create external table t2(
id int
name string
hobby array
add map
)
row format delimited
fields terminated by ''
collection items terminated by ''
map keys terminated by ''
location 'usert2'


装载数(t2)
load data local inpath 'homehadoopDesktopdata' overwrite into table t2

查文件位置
图NameNode50070explorerhtml#user目录t2文件

t1呢?前配置默认路径里

样通命令行获两者位置信息:
desc formatted table_name


注:图中managed table部表external table外部表
##分删部表外部表
面分删部表外部表查区

观察HDFS文件
发现t1已存

t2然存

外部表仅仅删元数
重新创建外部表t2
create external table t2(
id int
name string
hobby array
add map
)
row format delimited
fields terminated by ''
collection items terminated by ''
map keys terminated by ''
location 'usert2'


里面插入数select * 结果

见数然
官网解释
官网中关external表介绍:
A table created without the EXTERNAL clause is called a managed table because Hive manages its data
Managed and External Tables
By default Hive creates managed tables where files metadata and statistics are managed by internal Hive processes A managed table is stored under the hivemetastorewarehousedir path property by default in a folder path similar to appshivewarehousedatabasenamedbtablename The default location can be overridden by the location property during table creation If a managed table or partition is dropped the data and metadata associated with that table or partition are deleted If the PURGE option is not specified the data is moved to a trash folder for a defined duration
Use managed tables when Hive should manage the lifecycle of the table or when generating temporary tables
An external table describes the metadata schema on external files External table files can be accessed and managed by processes outside of Hive External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations If the structure or partitioning of an external table is changed an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information
Use external tables when files are already present or in remote locations and the files should remain even if the table is dropped
Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type
Statistics can be managed on internal and external tables and partitions for query optimization
Hive官网介绍:
httpscwikiapacheorgconfluencedisplayHiveLanguageManual+DDL#LanguageManualDDLDescribeTableViewColumn
Hive数仓库拉链表流水表全量表增量表
1 全量表:天新状态数
2 增量表:天新增数增量数次导出新数
3 拉链表:维护历史状态新状态数种表拉链表根拉链粒度实际相快做优化部分变记录已通拉链表方便原出拉链时点客户记录
4 流水表: 表修改会记录反映实际记录变更

拉链表通常账户信息历史变动进行处理保留结果流水表天交易形成历史
流水表统计业务相关情况拉链表统计账户客户情况
数仓库拉链表(原理设计Hive中实现)

情况保持历史状态需拉链表做样做目保留状态情况节省空间

拉链表适种情况吧

数量点表中某字段变化呢变化频率高业务需求呢需统计种变化状态天全量份呢点太现实

仅浪费存储空间时业务统计点麻烦时拉链表作提现出节省空间满足需求

般数仓中通增加begin_dateen_date表示例两列start_dateend_date

1 20160820 20160820 创建 20160820 20160820
1 20160820 20160821 支付 20160821 20160821
1 20160820 20160822 完成 20160822 99991231
2 20160820 20160820 创建 20160820 20160820
2 20160820 20160821 完成 20160821 99991231
3 20160820 20160820 创建 20160820 20160821
3 20160820 20160822 支付 20160822 99991231
4 20160821 20160821 创建 20160821 20160821
4 20160821 20160822 支付 20160822 99991231
5 20160822 20160822 创建 20160822 99991231
begin_date表示该条记录生命周期开始时间end_date表示该条记录生命周期结束时间

end_date 99991231’表示该条记录目前处效状态

果查询前效记录select * from order_his where dw_end_date 99991231′

果查询20160821历史快select * from order_his where begin_date < 20160821′ and end_date > 20160821’

简单介绍拉链表更新:

假设天维度天状态天终状态

张订单表例原始数天订单状态明细

1 20160820 20160820 创建
2 20160820 20160820 创建
3 20160820 20160820 创建
1 20160820 20160821 支付
2 20160820 20160821 完成
4 20160821 20160821 创建
1 20160820 20160822 完成
3 20160820 20160822 支付
4 20160821 20160822 支付
5 20160822 20160822 创建
根拉链表希


1 20160820 20160820 创建 20160820 20160820
1 20160820 20160821 支付 20160821 20160821
1 20160820 20160822 完成 20160822 99991231
2 20160820 20160820 创建 20160820 20160820
2 20160820 20160821 完成 20160821 99991231
3 20160820 20160820 创建 20160820 20160821
3 20160820 20160822 支付 20160822 99991231
4 20160821 20160821 创建 20160821 20160821
4 20160821 20160822 支付 20160822 99991231
5 20160822 20160822 创建 20160822 99991231
出 1234订单状态统计前效状态

例hive例考虑实现性关

首先创建表

CREATE TABLE orders (
orderid INT
createtime STRING
modifiedtime STRING
status STRING
) row format delimited fields terminated by '\t'


CREATE TABLE ods_orders_inc (
orderid INT
createtime STRING
modifiedtime STRING
status STRING
) PARTITIONED BY (day STRING)
row format delimited fields terminated by '\t'


CREATE TABLE dw_orders_his (
orderid INT
createtime STRING
modifiedtime STRING
status STRING
dw_start_date STRING
dw_end_date STRING
) row format delimited fields terminated by '\t'
首先全量更新先20160820止数

初始化先20160820数初始化进

INSERT overwrite TABLE ods_orders_inc PARTITION (day '20160820')
SELECT orderidcreatetimemodifiedtimestatus
FROM orders
WHERE createtime < '20160821' and modifiedtime <'20160821'
刷dw中

INSERT overwrite TABLE dw_orders_his
SELECT orderidcreatetimemodifiedtimestatus
createtime AS dw_start_date
'99991231' AS dw_end_date
FROM ods_orders_inc
WHERE day '20160820'

结果

select * from dw_orders_his
OK
1 20160820 20160820 创建 20160820 99991231
2 20160820 20160820 创建 20160820 99991231
3 20160820 20160820 创建 20160820 99991231
剩余需进行增量更新


INSERT overwrite TABLE ods_orders_inc PARTITION (day '20160821')
SELECT orderidcreatetimemodifiedtimestatus
FROM orders
WHERE (createtime '20160821' and modifiedtime '20160821') OR modifiedtime '20160821'

select * from ods_orders_inc where day'20160821'
OK
1 20160820 20160821 支付 20160821
2 20160820 20160821 完成 20160821
4 20160821 20160821 创建 20160821
先放增量表中然进行关联张时表中插入新表中


DROP TABLE IF EXISTS dw_orders_his_tmp
CREATE TABLE dw_orders_his_tmp AS
SELECT orderid
createtime
modifiedtime
status
dw_start_date
dw_end_date
FROM (
SELECT aorderid
acreatetime
amodifiedtime
astatus
adw_start_date
CASE WHEN borderid IS NOT NULL AND adw_end_date > '20160821' THEN '20160821' ELSE adw_end_date END AS dw_end_date
FROM dw_orders_his a
left outer join (SELECT * FROM ods_orders_inc WHERE day '20160821') b
ON (aorderid borderid)
UNION ALL
SELECT orderid
createtime
modifiedtime
status
modifiedtime AS dw_start_date
'99991231' AS dw_end_date
FROM ods_orders_inc
WHERE day '20160821'
) x
ORDER BY orderiddw_start_date

INSERT overwrite TABLE dw_orders_his
SELECT * FROM dw_orders_his_tmp
根面步骤20160822号数更新进结果


select * from dw_orders_his
OK
1 20160820 20160820 创建 20160820 20160820
1 20160820 20160821 支付 20160821 20160821
1 20160820 20160822 完成 20160822 99991231
2 20160820 20160820 创建 20160820 20160820
2 20160820 20160821 完成 20160821 99991231
3 20160820 20160820 创建 20160820 20160821
3 20160820 20160822 支付 20160822 99991231
4 20160821 20160821 创建 20160821 20160821
4 20160821 20160822 支付 20160822 99991231
5 20160822 20160822 创建 20160822 99991231
想数

值注意订单表中数天次状态更新应天状态天终状态天订单状态创建支付完成应拉取终状态进行拉练表更新否面数会出现异常
1 6 20160822 20160822 创建 20160822 99991231
2 6 20160822 20160822 支付 20160822 99991231
3 6 20160822 20160822 完成 20160822 99991231


Hadoop 320 完全分布式集群搭建
集群环境搭建
首先准备4台服务器(虚拟机)
设置静态ip址映射 centos7 修改静态ip设置址映射
址映射

然设置集群SSH免密登录分发脚 centos7配置集群SSH免密登录(包含群发文件脚)
果防火墙记关闭防火墙
1 systemctl stop firewalldservice
2 systemctl disable firewalldservice
载 hadoop312 分传4台服务器 roothadoop 目录
址 httparchiveapacheorgdisthadoopcommonhadoop320hadoop320targz
传 4台设备 分执行解压 重命名
1 cd roothadoop
2 tar zxvf roothadoophadoop320targz
 
安装JDK18 jdk载解压 rootjdk 重命名 jdk8
载址 
httpswwworaclecomtechnetworkjavajavasedownloadsjdk8downloads2133151html

编辑 etcprofile 设置环境变量
vim etcprofile
 export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL 添加
1 export JAVA_HOMErootjdkjdk8
2 export JRE_HOMErootjdkjdk8jre
3 export HADOOP_HOMEroothadoophadoop320
4 PATHPATHHOMEbinJAVA_HOMEbinHADOOP_HOMEbinHADOOP_HOMEsbinrootbin
5 export LD_LIBRARY_PATHHADOOP_HOMElibnative
6 export HADOOP_CONF_DIRHADOOP_HOMEetchadoop
 
更新 profile 文件
source etcprofile
测试环境变量否生效
java version
hadoop version

二Hadoop配置修改
进入 hadoop 配置文件目录
cd roothadoophadoop320etchadoop
修改 hadoopenvsh 配置 jdk 路径定义集群操作户
面增加
1 export JAVA_HOMErootjdkjdk8
2
3 export HDFS_NAMENODE_USERroot
4 export HDFS_DATANODE_USERroot
5 export HDFS_SECONDARYNAMENODE_USERroot
6 export YARN_RESOURCEMANAGER_USERroot
7 export YARN_NODEMANAGER_USERroot
8
9 export HADOOP_PID_DIRroothadoopdatapids
10 export HADOOP_LOG_DIRroothadoopdatalogs
 
修改 coresitexml hadoop核心配置
1
2
3 fsdefaultFS
4 hdfshadoop18020
5
6
7 hadooptmpdir
8 roothadoopdatatmp
9
10
· fsdefaultFSNameNode址hadooptmpdirhadoop时目录址
 
修改 hdfssitexml hadoop 节点配置
1
2
3 dfsnamenodehttpaddress
4 hadoop19870
5
6
7 dfsnamenodesecondaryhttpaddress
8 hadoop250090
9
10
11 dfsreplication
12 2
13
14
15 dfsnamenodenamedir
16 fileroothadoopdatahdfsname
17
18
19 dfsdatanodedatadir
20 fileroothadoopdatahdfsdata
21
22
23
· dfsreplication 副数
· dfsnamenodesecondaryhttpaddress指定secondaryNameNodehttp访问址端口号
· 里 hadoop2 设置 SecondaryNameNode服务器 
 
修改 workers 告知 hadoop hdfsDataNode节点
1 hadoop2
2 hadoop3
3 hadoop4
 
修改 yarnsitexml 配置yarn服务
1
2
3 yarnnodemanagerauxservices
4 mapreduce_shuffle
5
6
7 yarnnodemanagerlocalizeraddress
8 00008140
9
10
11 yarnresourcemanagerhostname
12 hadoop1
13
14
15 yarnresourcemanagerwebappaddress
16 hadoop18088
17
18
19 yarnlogaggregationenable
20 true
21
22
23 yarnlogaggregationretainseconds
24 604800
25
26
27 yarnlogserverurl
28 httphadoop419888jobhistorylogs
29
30
· yarnresourcemanagerwebappaddress  配置 resourcemanager  服务器址端口
· yarnresourcemanagerhostname  指定  resourcemanager  服务器
· yarnlogaggregationenable  配置否启日志聚集功
· yarnlogaggregationretainseconds  配置聚集日志HDFS保存长时间
· yarnlogserverurl  配置yarn日志服务器址
修改mapredsitexml 文件
1
2
3 mapreduceframeworkname
4 yarn
5
6
7 yarnappmapreduceamenv
8 HADOOP_MAPRED_HOMEroothadoophadoop320
9
10
11 mapreducemapenv
12 HADOOP_MAPRED_HOMEroothadoophadoop320
13
14
15 mapreducereduceenv
16 HADOOP_MAPRED_HOMEroothadoophadoop320
17
18
19 mapreducejobhistoryaddress
20 hadoop410020
21
22
23 mapreducejobhistorywebappaddress
24 hadoop419888
25
26
· yarnappmapreduceamenv   mapreducemapenv   mapreducereduceenv 
· 三mapreduce指定hadoop目录 果配置会出现  运行mapreduce找main方法等错误
· mapreducejobhistoryaddress  配置务历史服务器址
· mapreducejobhistorywebappaddress 配置历史服务器web访问址
 
修改配置文件 分发 节点三台服务器
前目录 执行xsync分发脚
1 xsync hadoopenvsh
2 xsync coresitexml
3 xsync hdfssitexml
4 xsync workers
5 xsync yarnsitexml
6 xsync mapredsitexml
 
配置完成
三Hadoop服务启动
hadoop1节点执行namenode初始格式化命令 (仅第次启动需执行)
hdfs namenode format

执行成功生成目录
cd roothadoopdatadfsname


生成集群唯id说明执行成功
hadoop1执行命令
1 startdfssh
2 startyarnsh
者执行
startallsh
hadoop4执行命令 启动job历史服务
mapred daemon start historyserver
执行完成分4台设备jps查进程

启动成功 
· hadoop1  NameNode ResourceManager 节点
· hadoop2  SecondaryNameNode DataNode NodeManager 节点
· hadoop3  DataNode NodeManager 节点
· hadoop4  DataNode NodeManager JobHistoryServer 节点
查HDFS Web界面 httphadoop19870

查 YARN web界面  httphadoop18088 

hadoop搭建完成
四运行WordCount
首先 root 目录创建txt文件
vim roottesttxt

1 hadoop 1
2 hadoop 2
3 hadoop 3
4 hadoop 4
5 hadoop 5
6 hadoop 6
7 hadoop 7
8 hadoop 8
9 hadoop 9
10 hadoop 10
testtxt文件传 hdfs 执行命令
1 hdfs dfs mkdir userroot
2 hdfs dfs put roottesttxt userroot
找官方带案例jar
cd roothadoophadoop320sharehadoopmapreduce

运行jar执行MapReduce WordCount案例
hadoop jar hadoopmapreduceexamples320jar wordcount userroottesttxt rootoutput
·  wordcount 第路径 文件路径
· 第二路径 结果输出路径 (必须存)

执行成功 查执行结果
hdfs dfs lsr rootoutput

success执行成功
partr00000  m mapper 输出 r reduce 输出 00000 job 务编号 整文件结果输出文件
hdfs dfs cat rootoutputpartr00000

文件中 hadoop 词出现 10次 结果正确

Linux台载MySQL
1 官网载安装包
载链接:点击开链接
httpsdevmysqlcomdownloadsmysql

果系统32位选择第64位选择第二
wget 载
wget httpsdevmysqlcomgetDownloadsMySQL80mysql8011linuxglibc212i686targz
解压文件
tar zxvf mysql8011linuxglibc212i686targz
2  移动压缩包usrlocal目录重命名文件
mv rootmysql8011linuxglibc212i686  usrlocalmysql
3MySQL根目录新建文件夹data存放数
mkdir data
4创建 mysql 户组 mysql 户
1 groupadd mysql
2
3 useradd g mysql mysql
 
5改变 mysql 目录权限
1 chown R mysqlmysql usrlocalmysql
2
3 者
4
5 chown R mysql
6
7 chgrp R mysql
注意点
 
6初始化数库
创建mysql_install_db安装文件
1 mkdir mysql_install_db
2 chmod 777 mysql_install_db
初始化 
binmysqld initialize usermysql basedirusrlocalmysql datadirusrlocalmysqldata 初始化数库

usrlocalmysqlbinmysqld initialize usermysql
1 usrlocalmysqlbinmysqld initialize usermysql
2
3 usrlocalmysqlbinmysqld (mysqld 8011) initializing of server in progress as process 5826
4
5 [Server] A temporary password is generated for root@localhost twiTlsi<0O
6
7 usrlocalmysqlbinmysqld (mysqld 8011) initializing of server has completed
记录时密码:
   twiTlsi<0O
 
里遇问题没libnumaso1
 
zsh command not found mysqld
 binmysqld initialize
binmysqld error while loading shared libraries libnumaso1 cannot open shared object file No such file or directory
20180429 170630 [WARNING] mysql_install_db is deprecated Please consider switching to mysqld initialize
20180429 170630 [ERROR]   Can't locate the language directory
需安装 libnuma
1 yum install libnuma
2
3 yum y install  numactl
4
5 yum install libaio1 libaiodev
安装文件
 
 7mysql配置
cp usrlocalmysqlsupportfilesmysqlserver etcinitdmysqld
修改mycnf文件
vim  etcmycnf
1
2 [mysqld]
3 basedir usrlocalmysql
4 datadir usrlocalmysqldata
5 socket usrlocalmysqlmysqlsock
6 charactersetserverutf8
7 port 3306
8 sql_modeNO_ENGINE_SUBSTITUTIONSTRICT_TRANS_TABLES
9 [client]
10 socket usrlocalmysqlmysqlsock
11 defaultcharactersetutf8
esc保存
wq 退出
 
8建立MySQL服务
cp a supportfilesmysqlserver etcinitdmysqld
1 cp mysqlserver etcinitdmysql
2 chmod +x etcinitdmysql
添加系统服务
chkconfig add mysql
cp a supportfilesmysqlserver etcinitdmysqld
 chmod +x etcrcdinitdmysqld    
chkconfig add mysqld
检查服务否生效  
chkconfig  list mysqld
9 配置全局环境变量
编辑 etcprofile 文件
# vi etcprofile
profile 文件底部添加两行配置保存退出
export PATHPATHusrlocalmysqlbinusrlocalmysqllib
export PATH
设置环境变量立生效
 source etcprofile
10启动MySQL服务
service mysql start
查初始密码
cat rootmysql_secret
11登录MySQL
mysql uroot p密码
修改密码:
SET PASSWORD FOR 'root'@localhostPASSWORD('123456')   #应换成密码

12设置远程登录
                             mysql>use mysql
                            mysql>update user set host'' where user'root' limit 1
                                刷新权限
                             mysql>flush privileges
然检查3306端口否开放
netstat nupl|grep 3306
开放3306端口
firewall cmd permanent addprot3306tcp
重启防火墙
firewall cmd reload

yum安装MySQL
安装环境:AliyunLinux(阿里linux系统64位)
cat etcosrelease

getconf LONG_BIT


首先系统中没带mysql东西先删掉
查:
 
find name mysql
删:
 
rm rf 边查找路径路径空格隔开
#者边条命令
 
find namemysql|xargs rm rf

开始安装
rpm Uvh httpsrepomysqlcommysql57communityreleaseel711noarchrpm

yum enablerepomysql80community install mysqlcommunityserver

步开始询问选择概意思:

总371M否载?
输入y然回车

概意思文件中检索密钥MySQL导入GPG问否OK?(英文谅解)
输入y然回车

Complete 完成
查mysql状态:
service mysqld start

接需查mysql创建默认密码首次登陆配置mysql时需
grep A temporary password varlogmysqldlog

mysql默认密码开始配置mysql
mysql_secure_installation





登陆数库:mysql u root p

功告成咯
需提醒阿里云版系统防火墙默认关闭设置果需外连接数库话记检查阿里云服务器安全组里否开放数库默认端口3306
然进入mysql库中修改update user set host'' where user'root'
sqlyog等工具连接数库
坑:
sqlyog连接数库时出现错误提示:Authentication plugin caching_sha2_password’ cannot be loaded
客户端支持caching_sha2_password种密码加密方式
需修改密码老版密码验证方式
登陆数库进入mysql库
update user set host'' where user'root'
重启:service mysqld restart
ALTER USER 'root'@'' IDENTIFIED WITH mysql_native_password BY '新密码'
重启:service mysqld restart

里 Abc123456a 新密码

修改退出sqlyog连接试试?

连接成功

Hive环境搭建
前提:
1 安装Hive前求先预装:
2 安装JDK 8
3 安装Hadoop277
4 安装MySQL
安装
1 载hive解压缩户目录:
1 tar xzvf apachehive236bintargz
2 改名:
3 mv apachehive236bin hive
2 设置环境变量:
二配置理
首先进入conf目录带template缀文件移缀
中hivedefaultxml移缀需修改名hivesitexml

1 通方法Hive进行配置:
11 修改hiveenvsh
1 cp hiveenvshtemplate hiveenvsh
2 Hive Hadoop 需 hiveenvsh 文件中指定 Hadoop 安装路径:
3
4 vim hiveenvsh
5
6 开配置文件中添加行:
7
8 export JAVA_HOMEusrlocalhadoopjdk180_221
9 export HADOOP_HOMEusrlocalhadoophadoop277
10 export HADOOP_CONF_DIRHADOOP_HOMEetchadoop
11 export HIVE_HOMEusrlocalhive
12 export HIVE_CONF_DIRHIVE_HOMEconf
13 export HIVE_AUX_JARS_PATHHIVE_HOMElib
12 修改hivelog4j2properties配置hivelog
1 cp hivelog4j2propertiestemplate hivelog4j2properties
2
3 vim confhivelog4j2properties
4
5 配置面参数(果没logs目录hive根目录创建):
6
7 propertyhivelogdirusrlocalhivelogs
13 usrlocalhive215新建tmp目录tmp新建hive目录
1 cd usrlocalhive
2 mkdir tmp
3 mkdir tmphive
14 修改hivesitexml
1 cp hivedefaultxmltemplate hivesitexml
2
3 hivesitexml文件中:
4
5 {systemjavaiotmpdir}换成homehduserhivetmp
6
7 {systemusername}换1921688101 节点名
2) hivesitexml 中配置 MySQL 数库连接信息:
面配置信息需改写出需文件弄外部ctrl+f进行搜索应里数然进行修改
1
2
3
4 < 设置面属性 >
5
6 hiveexecscratchdir
7 tmphive
8
9
10
11 hiveexeclocalscratchdir
12 usrlocalhivetmphive
13 Local scratch space for Hive jobs
14
15
16
17 hivedownloadedresourcesdir
18 usrlocalhivetmp{hivesessionid}_resources
19 Temporary local directory for added resources in the remote file system
20
21
22
23 hivequeryloglocation
24 usrlocalhivetmphive
25 Location of Hive run time structured log file
26
27
28
29 hiveauxjarspath
30 usrlocalhivelibusrlocalhivejdbc
31 These JAR file are available to all users for all jobs
32
33
34 hivemetastorewarehousedir
35 hdfs 19216881019000userhivewarehouse
36 相fsdefaultname关目录理表存储位置
37
38
39 <配置Hive Metastore>
40
41 javaxjdooptionConnectionURL
42 jdbcmysql 19216881013306hivecreateDatabaseIfNotExisttrue&characterEncodingUTF8
43
44
45
46 javaxjdooptionConnectionDriverName
47 commysqljdbcDriver 高版驱动需改成commysqlcjjdbcDriver
48
49
50
51 javaxjdooptionConnectionUserName
52 root
53
54
55
56 javaxjdooptionConnectionPassword
57 123 里mysql密码
58
59
60 <配置hiveserver2机(里配置ip址便Windows连接)>
61
62 hiveserver2thriftbindhost
63 1921688101
64 Bind host on which to run the HiveServer2 Thrift service
65
66
67 <配置beeline远程客户端连接时户名密码户名应hadoop配置文件coresitexml中配置>
68
69 hiveserver2thriftclientuser
70 1921688101
71 Username to use against thrift client default is 'anonymous'
72
73
74
75 hiveserver2thriftclientpassword
76 123 里机户密码
77 Password to use against thrift client default is 'anonymous'
78
79
80 < 配置面两属性配置 hive 2x web ui >
81
82 hiveserver2webuihost
83 1921688101
84
85 < 重启HiveServer2访问http172162121710002 >
86
14 配置Hive Metastore
1 默认情况 Hive元数保存嵌derby数库里 般情况生产环境MySQL存放Hive元数
2 mysqlconnectorjavaxxxjar 放入 HIVE_HOMElib (mysql jdbc驱动程序)
里注意mysql版定mysqlconnectorjavaxxxjar版低然会报错兼容(里重时卡久问题时候问题)
三运行
1 运行Hive CLI
命令行运行hive命令时必须保证HDFS已启动startdfssh启动HDFS
(特说明: Hive 21 版开始 第次运行hive前需先运行schematool命令执行初始化操作)
1 果MySQL数库:
先启动mysql服务器:执行
systemctl enable mysqldservice
· 执行初始化操作
schematool initSchema dbType mysql
执行成功查MySQL中元数库hive否已创建成功
2 果derby数库:
schematool initSchema dbType derby
进入hive命令行:
hive
(hive service metastore &)
show tables 显示表
1 hive> show tables
2
3 退出hive
4 hive> quit
创建数库然创建表
1 hive> drop table chun
2 OK
3 Time taken 1125 seconds
4 hive> create database chun
5 OK
6 Time taken 0099 seconds
7 hive> use chun
8 OK
9 Time taken 0024 seconds
·
面进入HDFS web端查Hive仓库
浏览器输入:19216822813850070刚创建表

Apache Mahout环境搭建
1载解压Mahout
httparchiveapacheorgdistmahout
tar zxvf mahoutdistribution09targz

2配置环境变量
# set mahout environment
export MAHOUT_HOMEmntjediaelmahoutmahoutdistribution09
export MAHOUT_CONF_DIRMAHOUT_HOMEconf
export PATHMAHOUT_HOMEconfMAHOUT_HOMEbinPATH

3安装mahout
[jediael@master mahoutdistribution09] pwd
mntjediaelmahoutmahoutdistribution09
[jediael@master mahoutdistribution09] mvn install

4验证Mahout否安装成功
    执行命令mahout列出算法成功:
[jediael@master mahoutdistribution09] mahout
Running on hadoop using mntjediaelhadoop121binhadoop and HADOOP_CONF_DIR
MAHOUTJOB mntjediaelmahoutmahoutdistribution09examplestargetmahoutexamples09jobjar
An example program must be given as the first argument
Valid program names are
arffvector Generate Vectors from an ARFF file or directory
baumwelch BaumWelch algorithm for unsupervised HMM training
canopy Canopy clustering
cat Print a file or resource as the logistic regression models would see it
cleansvd Cleanup and verification of SVD output
clusterdump Dump cluster output to text
clusterpp Groups Clustering Output In Clusters
cmdump Dump confusion matrix in HTML or text formats
concatmatrices Concatenates 2 matrices of same cardinality into a single matrix
cvb LDA via Collapsed Variation Bayes (0th deriv approx)
cvb0_local LDA via Collapsed Variation Bayes in memory locally
evaluateFactorization compute RMSE and MAE of a rating matrix factorization against probes
fkmeans Fuzzy Kmeans clustering
hmmpredict Generate random sequence of observations by given HMM
itemsimilarity Compute the itemitemsimilarities for itembased collaborative filtering
kmeans Kmeans clustering
lucenevector Generate Vectors from a Lucene index
lucene2seq Generate Text SequenceFiles from a Lucene index
matrixdump Dump matrix in CSV format
matrixmult Take the product of two matrices
parallelALS ALSWR factorization of a rating matrix
qualcluster Runs clustering experiments and summarizes results in a CSV
recommendfactorized Compute recommendations using the factorization of a rating matrix
recommenditembased Compute recommendations using itembased collaborative filtering
regexconverter Convert text files on a per line basis based on regular expressions
resplit Splits a set of SequenceFiles into a number of equal splits
rowid Map SequenceFile to {SequenceFile SequenceFile}
rowsimilarity Compute the pairwise similarities of the rows of a matrix
runAdaptiveLogistic Score new production data using a probably trained and validated AdaptivelogisticRegression model
runlogistic Run a logistic regression model against CSV data
seq2encoded Encoded Sparse Vector generation from Text sequence files
seq2sparse Sparse Vector generation from Text sequence files
seqdirectory Generate sequence files (of Text) from a directory
seqdumper Generic Sequence File dumper
seqmailarchives Creates SequenceFile from a directory containing gzipped mail archives
seqwiki Wikipedia xml dump to sequence file
spectralkmeans Spectral kmeans clustering
split Split Input data into test and train sets
splitDataset split a rating dataset into training and probe parts
ssvd Stochastic SVD
streamingkmeans Streaming kmeans clustering
svd Lanczos Singular Value Decomposition
testnb Test the Vectorbased Bayes classifier
trainAdaptiveLogistic Train an AdaptivelogisticRegression model
trainlogistic Train a logistic regression using stochastic gradient descent
trainnb Train the Vectorbased Bayes classifier
transpose Take the transpose of a matrix
validateAdaptiveLogistic Validate an AdaptivelogisticRegression model against holdout data set
vecdist Compute the distances between a set of Vectors (or Cluster or Canopy they must fit in memory) and a list of Vectors
vectordump Dump vectors from a sequence file to text
viterbi Viterbi decoding of hidden states from given output states sequence



二简单示例验证mahout
1启动Hadoop
2载测试数
           httparchiveicsuciedumldatabasessynthetic_control链接中synthetic_controldata
者百度容易找示例数
3传测试数
hadoop fs put synthetic_controldata testdata
4 Mahout中kmeans聚类算法执行命令:
mahout core  orgapachemahoutclusteringsyntheticcontrolkmeansJob
花费9分钟左右完成聚类
5查聚类结果
    执行hadoop fs ls userrootoutput查聚类结果

[jediael@master mahoutdistribution09] hadoop fs ls output
Found 15 items
rwrr 2 jediael supergroup 194 20150307 1507 userjediaeloutput_policy
drwxrxrx jediael supergroup 0 20150307 1507 userjediaeloutputclusteredPoints
drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputclusters0
drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputclusters1
drwxrxrx jediael supergroup 0 20150307 1507 userjediaeloutputclusters10final
drwxrxrx jediael supergroup 0 20150307 1503 userjediaeloutputclusters2
drwxrxrx jediael supergroup 0 20150307 1503 userjediaeloutputclusters3
drwxrxrx jediael supergroup 0 20150307 1504 userjediaeloutputclusters4
drwxrxrx jediael supergroup 0 20150307 1504 userjediaeloutputclusters5
drwxrxrx jediael supergroup 0 20150307 1505 userjediaeloutputclusters6
drwxrxrx jediael supergroup 0 20150307 1505 userjediaeloutputclusters7
drwxrxrx jediael supergroup 0 20150307 1506 userjediaeloutputclusters8
drwxrxrx jediael supergroup 0 20150307 1507 userjediaeloutputclusters9
drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputdata
drwxrxrx jediael supergroup 0 20150307 1502 userjediaeloutputrandomseeds

PySpark环境搭建
Linux台载Python 38
1赖安装
1 yum y install zlibdevel bzip2devel openssldevel ncursesdevel sqlitedevel readlinedevel tkdevel gdbmdevel db4devel libpcapdevel xzdevel libffidevel
2
3 yum y install gcc
2载安装包
载安装包:httpswwwpythonorgftppython380Python380tgz
放Linux操作系统中
3解压
1 # 解压
2 tar zxvf Python380tgz C <目标文件夹选>
进入解压文件目录

4安装
1 configure prefixusrlocalpython3
2 make && make install
5添加软链接
1 ln s usrlocalpython3binpython38 usrbinpython3
2 ln s usrlocalpython3binpip38 usrbinpip3
6测试
1 python3 V
2 pip3 V



Linux升级安装python38配置pipyum
查版
安装前查否已安装python里带python275版需删情况安装python381版
python V
二安装Python381
官网载址:httpswwwpythonorgdownloadssource

1 # 解压
2 tar zxf Python381tgz
3 # 安装赖包
4 yum install zlibdevel bzip2devel openssldevel ncursesdevel sqlitedevel readlinedevel tkdevel gcc libffidevel
5 # 进入python目录
6 cd Python381
7 # 编译
8 configure prefixusrlocalpython3
9 #安装
10 make && make install
系统默认python备份
里前带python275版避免文件重名直接名字改成python275
直接Xftp直接修改 命令
mv usrbinpython usrbinpython275
创建新软连接
软连接相windows新建快捷方式方便Linux需先找文件直接命令
快捷方式:Windows提供种快速启动程序开文件文件夹方法应程序快速连接
1 ln s usrlocalpython3binpython38 usrbinpython
2
3 ln s usrlocalpython3binpython38 usrbinpython3
输入面两命令python  python3命令 指定 python38
面命令时候直接复制时出现点问题:
ln invalid option ''
Try 'ln help' for more information
果出现问题话手动敲遍面软连接命令
查python版安装成功显示:Python 381
1 python V
2
3 python3 V
三修改yum配置
升级python38yum命令会运行需修改yum应头
yumurlgrabberextdown两文件 #usrbinpython 改 #usrbinpython27 
vi usrbinyum
vi usrlibexecurlgrabberextdown


四配置pip3
安装完 python381 pip install 载插件会动载 python27 带 pip 包里
pip软连接 python27 里先前 python27 版pip修改成 python38 版
备份27版软连接
mv usrbinpip usrbinpip275
配置pip3软连接 pip3python安装路径 bin 目录
1 ln s usrlocalpython3binpip3 usrbinpip
2
3 ln s usrlocalpython3binpip3 usrbinpip3
查pip版
1 pip V
2
3 pip3 V
五关yum删重新安装
1 删yum
rpm qa | grep yum | xargs rpm e nodeps
2 查Linux系统版
cat etcredhatrelease
3 查Linux核版
file binls
4 安装yum
 接需载安装具体址载路径:httpmirrors163com
次执行3命令centos78网站址找指定载
1 rpm ivh nodeps httpmirrors163comcentos782003osx86_64Packagesyummetadataparser11410el7x86_64rpm
2
3 rpm ivh nodeps httpmirrors163comcentos782003osx86_64Packagesyumpluginfastestmirror113153el7noarchrpm
4
5 rpm ivh nodeps httpmirrors163comcentos782003osx86_64Packagesyum343167el7centosnoarchrpm
果linux安装python 3X版需改yum文件中配置具体见该篇文章第三部分修改yum配置
Linux安装Python 38环境卸载旧Python
前提条件
首先连接网络(会转>linux虚拟机连接网络)搭建网络yum源
cd etcyumreposd
rm rf *
wget httpmirrors163comhelpCentOS7Base163repo
yum clean all
yum makecache
· 1
· 2
· 3
· 4
· 5
安装环境
yum install gcc patch libffidevel pythondevel zlibdevel bzip2devel openssldevel ncursesdevel sqlitedevel readlinedevel tkdevel gdbmdevel db4devel libpcapdevel xzdevel y
· 1
载Python 38源代码
Windows载址:python38
wget命令linux中载
wget httpswwwpythonorgftppython380Python380a2tgz
· 1
载速度真令心寒分享百度云盘中直接载
链接:httpspanbaiducoms1O5W8G66nKoFVphheedNAfQ
提取码:ysem
安装Python 38
tar包放入linux中然执行操作
tar zxf Python380a2tgz
cd Python380a2
configure prefixusrlocalpython_38
make j 4 make install
· 1
· 2
· 3
· 4
配置
配置PATH变量
ln s usrlocalpython_38bin* usrbin
· 1

[root@test2 ]# python38
Python 380a2 (default Mar 29 2020 145852)
[GCC 485 20150623 (Red Hat 48539)] on linux
Type help copyright credits or license for more information
>>> print('hello')
hello
>>> #Ctrl+d退出
· 1
· 2
· 3
· 4
· 5
· 6
· 7
想手动操作shell脚键安装转>CentOS7中shell脚安装python38环境载
卸载原Python环境
CentOS7中默认安装python27环境需卸载卸载导致系统崩溃请谨慎处理
rpm qa|grep python|xargs rpm ev allmatches nodeps
whereis python |xargs rm frv
whereis python

安装新版Python 2713Python 362(Python 2Python 3存修改默认版Python 362)
准备工作:
1 安装wget命令(线载安装包命令)
  yum y install wget
2 准备编译环境
1   yum groupinstall 'Development Tools'
2   yum install zlibdevel bzip2devel openssldevel ncursesdevel
 
开始安装:
1 进入载目录:
  cd usrlocalsrc
 
2 载安装新版python2:
1   wget httpswwwpythonorgftppython2713Python2713tgz
2   tar zxvf Python2713tgz
3   cd Python2713
4   configure 
5   make all
6   make install
7   make clean
8   make distclean
9   rm rf usrbinpython
10   rm rf usrbinpython2
11   rm rf usrbinpython27
12   ln s usrlocalbinpython27 usrbinpython
13   ln s usrlocalbinpython27 usrbinpython2
14   ln s usrlocalbinpython27 usrbinpython27
15   usrbinpython V
16   usrbinpython2 V
17   usrbinpython27 V
18   rm rf usrlocalbinpython
19   rm rf usrlocalbinpython2
20   ln s usrlocalbinpython27 usrlocalbinpython
21   ln s usrlocalbinpython27 usrlocalbinpython2
22   python V
23   python2 V
24   python27 V

3 载安装新版python3:
1   wget httpswwwpythonorgftppython362Python362tgz
2   tar zxvf Python362tgz
3   cd Python362
4   configure
5   make all
6   make install
7   make clean
8   make distclean
9   rm rf usrbinpython
10   rm rf usrbinpython3
11   rm rf usrbinpython36
12   ln s usrlocalbinpython36 usrbinpython
13   ln s usrlocalbinpython36 usrbinpython3
14   ln s usrlocalbinpython36 usrbinpython36
15   usrbinpython V
16   usrbinpython3 V
17   usrbinpython36 V
18   rm rf usrlocalbinpython
19   rm rf usrlocalbinpython3
20   ln s usrlocalbinpython36 usrlocalbinpython
21   ln s usrlocalbinpython36 usrlocalbinpython3
22   python V
23   python3 V
24   python36 V
安装pip
1 curl httpsbootstrappypaiogetpippy o getpippy   # 载安装脚
2 sudo python getpippy    # 运行安装脚
安装django
1 su root
2 pip install django1106
 
附加安装sqlite38方法
1 1 wget httpwwwsqliteorg2015sqliteautoconf3081101targz 
2
3 2tar xvzf sqliteautoconf3081101targz


Linux安装Apache Spark 310详细步骤
Linux安装spark 前提部署Hadoop安装Scala
应版
名称

JDK
18271
Hadoop
260
Scala
2110
Apache Spark
310
 第步 载jdk 8u271 for linux x64
载httpswwworaclecomjavatechnologiesjavasejavasejdk8downloadshtml
wget nocookies nocheckcertificate header Cookie gpw_e24http3A2F2Fwwworaclecom2F oraclelicenseacceptsecurebackupcookie httpsdownloadoraclecomotnjavajdk8u271b0961ae65e088624f5aaa0b1d2d801acb16jdk8u271linuxx64targzAuthParam1610434774_54f5ca4ffe47aeb4b53c758f1306d437


载  httpssparkapacheorgdownloadshtml者命令行wget httpsmirrorsocfberkeleyeduapachesparkspark301spark301binhadoop27tgz

 
 
第二步解压
tar zxvf spark220binhadoop26tgz
第三步配置环境变量
vi etcprofile
#SPARK_HOME
export SPARK_HOMEhomehadoopspark220binhadoop26
export PATHSPARK_HOMEbinPATH
第四步spark配置
sparkenvsh
JAVA_HOMEhomehadoopjdk180_144
SCALA_HOMEhomehadoopscala2110
HADOOP_HOMEhomehadoophadoop260
HADOOP_CONF_DIRhomehadoophadoop260etchadoop
SPARK_MASTER_IPltt1bgcn
SPARK_MASTER_PORT7077
SPARK_MASTER_WEBUI_PORT8080
SPARK_WORKER_CORES1
SPARK_WORKER_MEMORY2g #spark里许存方默认1g 2g 里设置1g
SPARK_WORKER_PORT7078
SPARK_WORKER_WEBUI_PORT8081
SPARK_WORKER_INSTANCES1
sparkdefaultsconf  
sparkmaster sparkltt1bgcn7077
slaves
ltt3bgcn
ltt4bgcn
ltt5bgcn

果整合hivehivemysql数库话需mysql数库连接驱动jmysqlconnectorjava517binjar放SPARK_HOMEjars目录

第五步spark220binhadoop26 分发节点启动
[hadoop@ltt1 sbin] startallsh
starting orgapachesparkdeploymasterMaster logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeploymasterMaster1ltt1bgcnout
ltt5bgcn starting orgapachesparkdeployworkerWorker logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeployworkerWorker1ltt5bgcnout
ltt4bgcn starting orgapachesparkdeployworkerWorker logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeployworkerWorker1ltt4bgcnout
ltt3bgcn starting orgapachesparkdeployworkerWorker logging to homehadoopspark220binhadoop26logssparkhadooporgapachesparkdeployworkerWorker1ltt3bgcnout
查进程
master节点
[hadoop@ltt1 sbin] jps
1346 NameNode
1539 JournalNode
1812 ResourceManager
1222 QuorumPeerMain
1706 DFSZKFailoverController
2588 Master
2655 Jps
worker节点
[hadoop@ltt5 ~] jps
1299 NodeManager
1655 Worker
1720 Jps
1192 DataNode
进入SparkWeb理页面: httpltt1bgcn8080

 
 spark安装完成
1 解压缩安装包 tar xvf jdk8u271linuxx64targz



1 进入解压缩jdk文件中pwd查前工作路径然修改文件vi ~bash_profile 

 3 ~bash_profile 文件末尾加jdk环境变量容: 
4刚修改文件生效java version查否配置成功
       
Spark安装配置
1 解压spark安装包tar xvf spark243binhadoop27tgz

2 进入解压文件中进入conf目录查配置文件

3修改配置文件sparkenvsh注意文件默认存里sparkenvshtemplate复制份命名新文件spaekenvsh

4查前JAVA_HOME路径步中

5修改文件sparkenvsh文件末尾添加容:

6回spark目录中找sbin目录然启动spark命令sbinstartallsh

7jps查否启动成功

8spark根目录examplesjars目录jar文件里面存放例子

9里jar包进行测试求圆周率

10回spark目录运行命令里面100设置值选择更值进行测试会更精确

结果显示:

11创建两目录inputoutput作文件输入输出目录

12输入目录中创建datatxt文件容


13启动sparkshell交互式工具黄框标记日志表示变量sc操作Spark context

14spark中scala语言统计单词出现次数
sctextFile读取文件split( )空格分隔字符  map((_1))单词计数里元祖
reduceByKey相进行累加

Spark集群安装设置
spark100新版20140530正式发布啦新spark版带新特性提供更API支持spark100增加Spark SQL组件增强标准库(MLstreamingGraphX)JAVAPython语言支持
面首先进行spark100集群安装里两台服务器台作masternamenode机台作slavedatanode机增加更slave需重复slave部分容:
系统版:
· master:Ubuntu 1204
· slave:Ubuntu 1204
· hadoop:hadoop 220
· spark:spark 100
1 安装JDKhadoop集群
安装程参见里httpwwwcnblogscomtecvegetablesp3778358html

2 载安装Scala
· scala载址httpwwwscalalangorgdownload2111html里载新版scala2111版
· 解压scala放usrlib目录
    tar xzvf scala2111tgz
    mv scala2111 usrlib
· 配置scala环境变量:sudo vi etcprofile
   文件末尾添加scala路径
   
   输入 source etcprofile 路径生效
· 测试scala:scala version   #出现scala版信息说明安装成功
PS:scala需slave节点配置
3 载安装spark
· spark100载址httpsparkapacheorgdownloadshtml解压spark放homehadoop
  tar xzvf spark100binhadoop2tgz
· 配置spark环境变量:sudo vi etcprofile
  文件末尾添加spark路径

  输入  source etcprofile  路径生效
· 配置confsparkenvsh文件
  没该文件 sparkenvshtemplate 文件重命名文件中添加scalajavahadoop路径master ip等信息
  mv sparkenvshtemplate sparkenvsh
  vi sparkenvsh
  
  
  
· confslaves中添加slave节点hostname行:
  vi slaves
  
4 slave机器安装配置spark
现master机spark文件分发slave节点注意slavemasterspark目录必须致master会登录slave执行命令认slavespark路径样
scp r spark100binhadoop2 hadoop@slavehomehadoop
5启动spark集群
master机执行命令:
cd ~spark100binhadoop2sbin
startallsh
检测进程否启动:输入 jps

配置完成
6 面体验spark带例子
binrunexample SparkPi

scala实现spark app
官方说明址例子统计输入文件中字母a字母b数网站提供scalajavapython三种实现里做scala吧里需安装SBT( sbt 创建测试运行提交作业简单SBT做Scala世界Maven)
spark100木带sbt选择手动安装然选择sudo aptget install sbt方式(系统中木找sbt包手动安装咯)安装方法:
· 载:sbt载址载现新版sbt0135
· 解压sbthomehadoop目录(hadoop户名实HOME啦)
  tar zxvf sbt0135tgz
  cd sbtbin
  java jar sbtlaunchjar    #进行sbt安装时间约时吧会载东东记联网哦
· 成功etcprofile中配置sbt环境变量
  sudo vi etcprofile
  
  输入source etcprofile 路径生效
sbt安装完成面写简单spark app吧
· 创建目录:mkdir ~SimpleApp
· SimpleApp目录创建目录结构:
        
· simplesbt文件容:
name Simple Project
version 10
scalaVersion 2104
libraryDependencies + orgapachespark sparkcore 100
resolvers + Akka Repository at httprepoakkaioreleases
· SimpleAppscala文件容:
* SimpleAppscala *
import orgapachesparkSparkContext
import orgapachesparkSparkContext_
import orgapachesparkSparkConf
object SimpleApp {
def main(args Array[String]) {
val logFile YOUR_SPARK_HOMEREADMEmd Should be some file on your system
val conf new SparkConf()setAppName(Simple Application)
val sc new SparkContext(conf)
val logData sctextFile(logFile 2)cache()
val numAs logDatafilter(line > linecontains(a))count()
val numBs logDatafilter(line > linecontains(b))count()
println(Lines with a s Lines with b sformat(numAs numBs))
}
}
PS:前spark配置程中hadoop路径配置里输入路径YOUR_SPARK_HOMEXXX实际HDFS文件系统中文件存储位置hadoop配置文件coresitexml中相关(具体参见里方容易出错)需先READMEmd文件puthdfs面:

· 编译:
  cd ~SimpleApp
  sbt package     #包程时间会较长会出现[success]XXX
  PS:成功会生成许文件 targetscala210simpleproject_21010jar等
· 运行:
  sparksubmit class SimpleApp master local targetscala210simpleproject_21010jar
· 结果:

7 停止spark集群
cd ~spark100binhadoop2sbin
stopallsh

JDKhadoop安装
Ubuntu 1204Hadoop 220 集群搭建
现家起实现Ubuntu 1204Hadoop 220 集群搭建里两台服务器台作masternamenode机台作slavedatanode机增加更slave需重复slave部分容
系统版:
· master:Ubuntu 1204
· slave:Ubuntu 1204
· hadoop:hadoop 220
· 安装ssh服务:sudo aptget install ssh
· 时更新vim:sudo aptget install vim  #刚安装系统会出现vi 命令键变成AB情况
masterslave机器安装JDK环境
载jdk果安装版170_60官方载址:java载
解压jdk: tar xvf  jdk7u60linuxi586targz
usrlocal新建java文件夹:mkdir usrlocaljava
解压文件移动创建java文件夹:sudo mv jdk170_60  usrlocaljava
修改etcprofile文件:sudo vi etcprofile
文件末尾添加jdk路径:

输入 source etcprofile java生效
测试java否完全安装:java version   #出现版信息说明安装成功
二修改namenode(master)子节点(slave)机器名:
sudo vi etchostname   
 

修改需重启生效:sudo reboot
三修改namenode(master)节点映射ip
sudo vi etchosts    #添加slavemaster机器名应ip
 
PS:masterslave分namenodedatanode机器名hostname名字
四masterslave分创建Hadoop户户组赋予sudo权限
sudo addgroup hadoop
sudo adduser ingroup hadoop hadoop   #第hadoop户组第二hadoop户名
面hadoop户赋予sudo权限:修改 etcsudoers 文件
sudo vi etcsudoers
添加hadoop  ALL(ALLALL) ALL

PS:该操作需masterslave机进行
五建立ssh密码登陆环境
hadoop身份登录系统:su hadoop
生成密钥建立namenodedatanode信关系ssh生成密钥rsadsa方式默认采rsa方式:
homehadoop目录输入: sshkeygen t rsa P
确认信息回车会homehadoopssh生成文件:
id_rsapub追加authorized_keys授权文件中: cat id_rsapub >> authorized_keys

子节点生成密钥:sshkeygen t rsa P
masterauthorized_keys发送子节点:
scp ~sshauthorized_keys  hadoop@slave1~ssh
面测试ssh互信: ssh hadoop@slave1
果需输入密码登录成功表示ssh互信成功建立
六安装hadoop(需配置master机slave机直接复制)
载hadoopusrlocal:载址
解压hadoop220targz:sudo tar zxf hadoop220targz 
解压出文件夹重命名hadoop: sudo mv hadoop220 hadoop
hadoop文件夹属户设hadoop:sudo chown R hadoophadoop hadoop
1配置etchadoophadoopenvsh文件
sudo vi usrlocalhadoopetchadoophadoopenvsh
找export JAVA_HOME部分修改机jdk路径

2配置etchadoopcoresitexml文件
sudo vi usrlocalhadoopetchadoopcoresitexml
中间添加容:

PS:masternamenode机名etchosts文件里名字
3配置etchadoopmapredsitexml文件路径没文件mapredsitexmltemplate重命名
sudo vi usrlocalhadoopetchadoopmapredsitexml
中间添加容:

PS:masternamenode机名etchosts文件里名字
4配置hdfssitexml文件路径没文件hdfssitexmltemplate重命名
sudo vi hdfssitexml
中间添加容:

PS:slave节点间容改:
· usrlocalhadoopdatalog1usrlocalhadoopdatalog2
· usrlocalhadoopdata1usrlocalhadoopdata2
· 1中间数字表示slave节点数
5slaves文件中添加slave机名行

七slave节点分发配置文件
配置文件发送slave子节点先文件复制子节点homehadoop面(子节点hadoop户登录:su hadoop)
sudo scp etchosts hadoop@slave1homehadoop
scp r usrlocalhadoop hadoop@slave1homehadoop
PS:slave1slave子节点名slave节点应全部分发
datanode机器(slave节点)文件移动master相路径
sudo mv homehadoophosts etchosts  (子节点执行)
sudo mv homelocalhadoop usrlocal  (子节点执行)
PS:提示mv文件夹加r 参数
加入属户: sudo chown R hadoophadoop hadoop    (子节点执行)
PS:子节点datanode机器复制hadoop里面data1data2logs删掉
配置完成
PS:hadoop命令路径写入etcprofile文件样hadoophdfs命令否命令时加入binhadoop样路径:
sudo vi etcprofile

输入:source etcprofile
八运行WordCount示例
首先进入usrlocalhadoop目录重启hadoop
cd usrlocalhadoopsbin
stopallsh
cd usrlocalhadoopbin
hdfs namenode format    # 格式化集群
cd usrlocalhadoopsbin
startallsh 
namenode查连接情况
hdfs dfsadmin report      #面机器结果

假设测试文件test1txttest2txt首先创建目录input
hadoop dfs mkdir input
测试文件传hadoop:
hadoop dfs put test1txt input
hadoop dfs put test2txt input
子节点离开安全模式否会导致法读取input文件:
hdfs dfsadmin –safemode leave
运行wordcount程序:
hadoop jar usrlocalhadoopsharehadoopmapreducehadoopmapreduceexamples220jar wordcount input output
查结果
hadoop dfs cat outputpartr00000
PS:次运行时需删output文件夹 :hadoop dfs rmr output
参考资料:
httpwwwcnblogscomkinglaup3794433html
httpwwwcnblogscomtecvegetablesp3778358html
httpblogcsdnnetlaoyi_gracearticledetails6254743

Ubuntu 1404安装Hadoop240(单机模式)
Ubuntu创建hadoop组hadoop户
    增加hadoop户组时该组里增加hadoop户续涉hadoop操作时该户
1创建hadoop户组 
        
2创建hadoop户    
    sudo adduser ingroup hadoop hadoop
    回车会提示输入新UNIX密码新建户hadoop密码输入回车
    果输入密码回车会重新提示输入密码密码空
    确认信息否正确果没问题输入 Y回车
 3hadoop户添加权限
     输入:sudo gedit etcsudoers
     回车开sudoers文件
hadoop户赋予root户样权限




  
二新增加hadoop户登录Ubuntu系统
 
三安装ssh
sudo aptget install opensshserver安装完成启动服务
sudo etcinitdssh start
 
查服务否正确启动:ps e | grep ssh

 
 
 
 
 
 
设置免密码登录生成私钥公钥
sshkeygen t rsa P

cat ~sshid_rsapub >> ~sshauthorized_keys

 
 
 
登录ssh
ssh localhost


退出
exit
 
四安装Java环境
sudo aptget install openjdk7jdk

 查安装结果输入命令:java version结果表示安装成功

 
 
 
 
五安装hadoop240
    1官网载httpmirrorbiteducnapachehadoopcommon
 
    2安装
 
        解压
        sudo tar xzf hadoop240targz        
        假hadoop安装usrlocal
        拷贝usrlocal文件夹hadoop
        sudo mv hadoop240 usrlocalhadoop        

         
赋予户该文件夹读写权限
        sudo chmod 774 usrlocalhadoop

     
3配置      
        1)配置~bashrc      
配置该文件前需知道Java安装路径设置JAVA_HOME环境变量面命令行查安装路径
        updatealternatives config java
        执行结果:

完整路径
    usrlibjvmjava7openjdkamd64jrebinjava
    取前面部分 usrlibjvmjava7openjdkamd64
    配置bashrc文件
    sudo gedit ~bashrc
    
    该命令会开该文件编辑窗口文件末尾追加面容然保存关闭编辑窗口
#HADOOP VARIABLES START
export JAVA_HOMEusrlibjvmjava7openjdkamd64
export HADOOP_INSTALLusrlocalhadoop
export PATHPATHHADOOP_INSTALLbin
export PATHPATHHADOOP_INSTALLsbin
export HADOOP_MAPRED_HOMEHADOOP_INSTALL
export HADOOP_COMMON_HOMEHADOOP_INSTALL
export HADOOP_HDFS_HOMEHADOOP_INSTALL
export YARN_HOMEHADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIRHADOOP_INSTALLlibnative
export HADOOP_OPTSDjavalibrarypathHADOOP_INSTALLlib
#HADOOP VARIABLES END
 
 
终结果图:

 
执行面命添加环境变量生效:
        source ~bashrc
2)编辑usrlocalhadoopetchadoophadoopenvsh
         执行面命令开该文件编辑窗口
        sudo gedit usrlocalhadoopetchadoophadoopenvsh
找JAVA_HOME变量修改变量
        export JAVA_HOMEusrlibjvmjava7openjdkamd64    
修改hadoopenvsh文件示:     

六WordCount测试
 单机模式安装完成面通执行hadoop带实例WordCount验证否安装成功
    usrlocalhadoop路径创建input文件夹    
mkdir input
 
    拷贝READMEtxtinput    
cp READMEtxt input
    执行WordCount
    binhadoop jar sharehadoopmapreducesourceshadoopmapreduceexamples240sourcesjar orgapachehadoopexamplesWordCount input output  执行结果:  
  执行 cat output*查字符统计结果




启动Spark集群
检查进程否启动Spark集群环境否搭建成功
1)启动HDFS集群

2)检查进程否启动

3)启动Spark集群
cd homehadoopspark121binhadoop24sbin
startallsh
jps (没启动hdfs 集群spark进程情况)
jps(启动hdfs集群spark进程运行情况)
浏览器输入httpmaster8080查spark集群运行状况


6)进入Sparkbin目录启动sparkshell 控制台

7)访问 httpmaster4040Spark WebUI页面Spark集群环境搭建成功








8)运行sparkshell测试
前userinput传READMEtxt文件现Spark读取HDFS中READMEtxt文件

取HDFS文件
(连接成功例子)

连接成功例子似Hadoop26版识localhost机IP址

CountREADMEtxt文件中文字总数


滤READMEtxt文件
包括单词



通wcREADMEtxt统计4The 单词

实现Hadoop wordcount 功
首先读取readmeFile执行命令

然collect命令提交job


WebUI执行效果:

cd
binrunexample SparkPi

停止集群
sbinstopallsh


OkSpark集群环境测试结束总结步骤:
(1)cd usrlocalhadoopsbin
startallsh
(2)cd homehadoopspark121binhadoop24sbin
startallsh
(3)jps
(4)http19216801188080
(5)cd homehadoopspark121binhadoop24bin
sparkshell
(6)http19216801184040
(7)scala编程程序提交测试
(8)集群停止
Spark性优化
般开发完Spark作业该作业配置合适资源Spark资源参数基sparksubmit命令中作参数设置资源参数设置会导致没充分利集群资源作业运行会极缓慢者设置资源队列没足够资源提供进导致种异常

pyspark开发代码例子说明
运行pyspark程序终端命令模式Linux终端输入pyspark然复制粘贴代码sparksubmit命令行Hive样yarn调度运行

# *codingutf8*
from pysparksql import HiveContext SparkSession

# 初始化SparkContext时启Hive支持
# 终端命令行测试模式输出字段长度设置100字符
spark SparkSessionbuilderappName(name)config(
sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()
# 初始化HiveContext
hive HiveContext(sparksparkContext)
# 启SparkSQL表连接支持
sparkconfset(sparksqlcrossJoinenabled true)

# 读取parquet文件数代码
# Parquet面分析型业务列式存储格式TwitterCloudera合作开发AWS中
# parquet文件数存储AWS S3
# AWSS3作数存储服务S3 全名 Simple Storage Service简便存储服务
df1 sparkreadload(
path'
format'parquet' headerTrue)

# 读取CSV文件数代码
# 边CSV文件作手工交换文件标准
# 原csv格式简单数字类型数字符串存储精度保证
df2 sparkreadload(
path'
format'csv' headerTrue)

# 读取Hive表视图数代码
df3 hivesql(
select
*
from <数库名><表名>)

# 次表数集进行数存缓存(第条Spark优化策略)
# 样话pyspark代码次调数时候Spark会重复读取相文件数
df4 sparkreadload(
path'
format'parquet' headerTrue)cache()

# 刚数集命名便放入SparkSQL编写查询语句
df1createOrReplaceTempView(DF1)

df2createOrReplaceTempView(DF2)

df3createOrReplaceTempView(DF3)

df4createOrReplaceTempView(DF4)

# 创建SparkSQL数集代码
# 果数量较业务逻辑复杂话数时缓存存储服务磁盘
# 避免pyspark代码SparkSQL调里SparkSQL数集时候
# 里SparkSQL数集重复运行计算逻辑节约计算资源(第二条Spark优化策略)
df5 sparksql(
SELECT

from DF1 AS D1
LEFT JOIN DF2 AS D2
ON
LEFT JOIN DF4 AS D4
ON
WHERE
)persist()
# countAction算子会触发sparksubmit事件前persist()缓存操作刻生效
# count()操作persist()缓存操作会Action算子处程序结束处生效
df5count()
df5createOrReplaceTempView(DF5)

# 创建SparkSQL数集代码
df6 sparksql(
SELECT

from DF5 AS D5
LEFT JOIN DF3 AS D3
ON
LEFT JOIN DF4 AS D4
ON
WHERE
)

# 写入结果数集parquet文件
df6writeparquet(
path'
modeoverwrite)

# 释放磁盘缓存
df5unpersist()

# sparkContext停止
sparkstop()

1Spark作业基运行原理

      详细原理见图sparksubmit提交Spark作业作业会启动应Driver进程根部署模式(deploymode)Driver进程启动集群中某工作节点启动Driver进程身会根设置参数占定数量存CPU coreDriver进程做第件事情集群理器(Spark Standalone集群资源理集群般YARN作资源理集群)申请运行Spark作业需资源里资源指Executor进程YARN集群理器会根Spark作业设置资源参数工作节点启动定数量Executor进程Executor进程占定数量存CPU core
  申请作业执行需资源Driver进程会开始调度执行编写作业代码Driver进程会编写Spark作业代码分拆stagestage执行部分代码片段stage创建批task然task分配Executor进程中执行task计算单元负责执行模样计算逻辑(编写某代码片段)task处理数已stagetask执行完毕会节点磁盘文件中写入计算中间结果然Driver会调度运行stagestagetask输入数stage输出中间结果循环复直编写代码逻辑全部执行完计算完数想结果止
  Spark根shuffle类算子进行stage划分果代码中执行某shuffle类算子(reduceByKeyjoin等)会该算子处划分出stage界限致理解shuffle算子执行前代码会划分stageshuffle算子执行代码会划分stagestage刚开始执行时候task会stagetask节点通网络传输拉取需处理key然拉取相key编写算子函数执行聚合操作(reduceByKey()算子接收函数)程shuffle
  代码中执行cachepersist等持久化操作时根选择持久化级task计算出数会保存Executor进程存者节点磁盘文件中
  Executor存分三块:第块task执行编写代码时默认占Executor总存20第二块task通shuffle程拉取stagetask输出进行聚合等操作时默认占Executor总存20第三块RDD持久化时默认占Executor总存60
  task执行速度Executor进程CPU core数量直接关系CPU core时间执行线程Executor进程分配tasktask条线程方式线程发运行果CPU core数量较充足分配task数量较合理通常说较快速高效执行完task线程
  Spark作业基运行原理说明家结合图理解理解作业基原理进行资源参数调优基前提
2资源参数调优
      解完Spark作业运行基原理资源相关参数容易理解谓Spark资源参数调优实Spark运行程中资源方通调节种参数优化资源效率提升Spark作业执行性参数Spark中资源参数参数应着作业运行原理中某部分时出调优参考值
numexecutors
  参数说明:该参数设置Spark作业总少Executor进程执行DriverYARN集群理器申请资源时YARN集群理器会设置集群工作节点启动相应数量Executor进程参数非常重果设置话默认会启动少量Executor进程时Spark作业运行速度非常慢
  参数调优建议:Spark作业运行般设置50~100左右Executor进程较合适设置太少太Executor进程设置太少法充分利集群资源设置太话部分队列法予充分资源
executormemory
  参数说明:该参数设置Executor进程存Executor存时候直接决定Spark作业性常见JVM OOM异常直接关联
  参数调优建议:Executor进程存设置4G~8G较合适参考值具体设置根部门资源队列定团队资源队列存限制少numexecutorsexecutormemory代表Spark作业申请总存量(Executor进程存总)量超队列存量外果团队里享资源队列申请总存量超资源队列总存13~12避免Spark作业占队列资源导致学作业法运行
executorcores
  参数说明:该参数设置Executor进程CPU core数量参数决定Executor进程行执行task线程力CPU core时间执行task线程Executor进程CPU core数量越越够快速执行完分配task线程
  参数调优建议:ExecutorCPU core数量设置2~4较合适样根部门资源队列定资源队列CPU core限制少设置Executor数量决定Executor进程分配CPU core样建议果享队列numexecutors * executorcores超队列总CPU core13~12左右较合适避免影响学作业运行
drivermemory
  参数说明:该参数设置Driver进程存
  参数调优建议:Driver存通常说设置者设置1G左右应该够唯需注意点果需collect算子RDD数全部拉取Driver进行处理必须确保Driver存足够否会出现OOM存溢出问题
sparkdefaultparallelism
  参数说明:该参数设置stage默认task数量参数极重果设置会直接影响Spark作业性
  参数调优建议:Spark作业默认task数量500~1000较合适学常犯错误设置参数时会导致Spark根底层HDFSblock数量设置task数量默认HDFS block应task通常说Spark默认设置数量偏少(十task)果task数量偏少话会导致前面设置Executor参数前功弃试想Executor进程少存CPUtask1者1090Executor进程根没task执行白白浪费资源Spark官网建议设置原设置该参数numexecutors * executorcores2~3倍较合适Executor总CPU core数量300设置1000task时充分利Spark集群资源
sparkstoragememoryFraction
  参数说明:该参数设置RDD持久化数Executor存中占例默认06说默认Executor 60存保存持久化RDD数根选择持久化策略果存够时数会持久化者数会写入磁盘
  参数调优建议:果Spark作业中较RDD持久化操作该参数值适提高保证持久化数够容纳存中避免存够缓存数导致数写入磁盘中降低性果Spark作业中shuffle类操作较持久化操作较少参数值适降低较合适外果发现作业频繁gc导致运行缓慢(通spark web ui观察作业gc耗时)意味着task执行户代码存够样建议调低参数值
sparkshufflememoryFraction
  参数说明:该参数设置shuffle程中task拉取stagetask输出进行聚合操作时够Executor存例默认02说Executor默认20存进行该操作shuffle操作进行聚合时果发现存超出20限制余数会溢写磁盘文件中时会极降低性
  参数调优建议:果Spark作业中RDD持久化操作较少shuffle操作较时建议降低持久化操作存占提高shuffle操作存占例避免shuffle程中数时存够必须溢写磁盘降低性外果发现作业频繁gc导致运行缓慢意味着task执行户代码存够样建议调低参数值
资源参数调优没固定值需学根实际情况(包括Spark作业中shuffle操作数量RDD持久化操作数量spark web ui中显示作业gc情况)时参考篇文章中出原理调优建议合理设置述参数
3资源参数参考示例
      份sparksubmit命令示例家参考根实际情况进行调节:
binsparksubmit \
master yarncluster \
numexecutors 100 \
executormemory 6G \
executorcores 4 \
drivermemory 1G \
conf sparkdefaultparallelism1000 \
conf sparkstoragememoryFraction05 \
conf sparkshufflememoryFraction03 \
4Spark中三种Join策略
Spark通常三种Join策略方式
1 Broadcast Hash Join(BHJ)
2 Shuffle Hash Join(SHJ)
3 Sort Merge Join(SMJ)
Broadcast Hash Join
表表进行Join操作时避免shuffle操作表数分发节点表进行Join操作牺牲空间避免耗时Shuffle操作

1 表需broadcast必须sparksqlautoBroadcastJoinThreshold配置值(默认10M)者明确添加broadcast join hint
2 base tablebroadcast例left outer join中仅仅right表broadcast
3 种算法仅仅broadcast表否数传输shuffle操作成高
4 broadcast需driver果太broadcastdriver存压力
Shuffle Hash Join
broadcast策略首先收集数driver节点然分发executor节点表太时broadcastchelve会driverexecutor造成压力
Shuffle Hash Join会减少driverexeuctor压力操作步骤:
1 两张表分连接列进行重组目相连接列记录分配分区
2 两张表表分区构造成hash表表根相应记录进行映射

Sort Merge Join
面两种发现适应定表两张表足够时面方法存造成压力两张表做Hash Join时中张表必须完成加载存中
两张表时Spark SQL种新算法做Join操作做Sort Merge Join种算法会加载数然开始Hash JoinJoin前进行数排序
两张表需进行数重组保证相连接列值分区中分区数排序分区中数然相应记录进行关联
表分区Sort Merge Join会加载张表数存


5Spark 30 中 AQE新特性
年Spark SQL优化CBO成功特性
CBO会计算业务数相关统计数优化查询例行数重行数空值值等
Spark根数动选择BHJ者SMJJoin场景Costbased Join Reorder(参考前写篇文章)达优化执行计划目
统计数需预先处理会时时数进行判断某情况反会变成负面效果拉低SQL执行效率
AQE执行程中统计数动态调节执行计划解决问题
1框架
AQE言重问题什时候重新计算优化执行计划Spark务算子果道排列次行执行然shuffle者broadcast exchange会断算子排列执行称物化点(Materialization Points)Query Stages代表物化点分割片段Query Stage会产出中间结果仅该stage行stage执行完成游Query Stage执行游部分stage执行完成partitions统计数获取游未开始执行AQE提供reoptimization机会


查询开始时生成完执行计划AQE框架首先会找执行存游stages旦stage完成AQE框架会physical plan中标记完成根已完成stages提供执行数更新整logical plan基新产出统计数AQE框架会执行optimizer根系列优化规进行优化AQE框架会执行生成普通physical planoptimizer适应执行专属优化规例分区合数倾斜处理等获新优化执行计划已执行完成stages次循环接着需继续重复面步骤直整query跑完
Spark 30中AQE框架拥三特征:
· 动态折叠shuffle程中partition
· 动态选择join策略
· 动态优化存数倾斜join
接具体三特征
① 动态合shuffle partitions
处理数量级非常时shuffle通常说影响性shuffle非常耗时算子需通网络移动数分发游算子
shuffle中partition数量十分关键partition佳数量取决数数querystage会差异难确定具体数目:
· 果partition少partition数量会会导致量数落磁盘拖慢查询
· 果partitionpartition数量会少会产生额外网络开销影响Spark task scheduler拖慢查询
解决该问题开始设置相较shuffle partition数通执行程中shuffle文件数合相邻partitions
例假设执行SELECT max(i) FROM tbl GROUP BY j表tbl2partition数量非常初始shuffle partition设5分组会出现5partitions进行AQE优化会产生5tasks做聚合结果事实3partitions数量非常

然种情况AQE会生成3reduce task

② 动态切换join策略
Spark支持众join中broadcast hash join性果需广播表预估广播限制阈值应该设BHJ表估计会导致决策错误join表filter(容易表估)者join表算子(容易表估)仅仅全量扫描张表
AQE拥精确游统计数解决该问题面例子右表实际15M该场景filter滤实际参join数8M默认broadcast阈值10M应该广播

执行程中转化BHJ时甚传统shuffle优化shuffle(例shuffle读mapper基reducer)减网络开销
③ 动态优化数倾斜
数倾斜集群数分区间分布均匀导致会拉慢join场景整查询AQE根shuffle文件统计数动检测倾斜数倾斜分区散成子分区然进行join
场景Table A join Table B中Table Apartition A0数远分区

AQE会partition A0切分成2子分区独Table Bpartition B0进行join

果做优化SMJ会产生4tasks中执行时间远优化join会5taskstask执行耗时差相整查询带更性
2
设置参数sparksqladaptiveenabledtrue开启AQESpark 30中默认false满足条件:
· 非流式查询
· 包含少exchange(join聚合窗口算子)者子查询
AQE通减少静态统计数赖成功解决Spark CBO难处理trade off(生成统计数开销查询耗时)数精度问题相前具局限性CBO现显非常灵活需提前分析数
6数仓库中数优化般原
l 种数放份:两张张表中种数字段数量字段键抽出变成张独立新表(数仓库设计)
l ETL中数筛选聚合操作量前置取数源数方减少整数流数量(ETL设计)
l 张表连接时提前张表中连接数进行排表连接时间复杂度笛卡尔积减少数性提高较明显应筛选数中参连接字段添加索引便进步提高表连接性

7Spark 中宽赖窄赖
Spark中RDD高效DAG图着莫关系DAG调度中需计算程划分stage划分RDD间赖关系针转换函数RDD间赖关系分类窄赖(narrow dependency)宽赖(wide dependency 称 shuffle dependency)
概述
· 窄赖指父RDD分区子RDD分区子RDD分区通常应常数父RDD分区(O(1)数规模关)
· 相应宽赖指父RDD分区子RDD分区子RDD分区通常应父RDD分区(O(n)数规模关)
宽赖窄赖图示:

相宽赖窄赖优化利 基两点:
1 宽赖应着shuffle操作需运行程中父RDD分区传入子RDD分区中中间涉节点间数传输窄赖父RDD分区会传入子RDD分区中通常节点完成转换
2 RDD分区丢失时(某节点障)spark会数进行重算
1 窄赖父RDD分区应子RDD分区样需重算子RDD分区应父RDD分区重算数利率100
2 宽赖重算父RDD分区应子RDD分区样实际父RDD 中部分数恢复丢失子RDD分区部分应子RDD未丢失分区造成余计算更般宽赖中子RDD分区通常父RDD分区极端情况父RDD分区进行重新计算
3 图示b1分区丢失需重新计算a1a2a3产生冗余计算(a1a2a3中应b2数)

详细运行原理

图中左边宽赖父RDD4号分区数划分子RDD分区(分区分区)表明shuffle程父分区数shuffle程hash分区器(定义分区器)划分子RDD例GroupByKeyreduceByKeyjoinsortByKey等操作
图右边窄赖父RDD分区数直接子RDD应分区(分区分区)例1号5号分区数进入子RDD分区程没shuffleSpark中Stage划分通shuffle划分(shuffle理解数原分区乱重组新分区):mapfilter
总结:果父RDDPartition子RDDPartition窄赖否话宽赖
宽窄赖容错性
Spark基lineage容错性指果RDD出错父RDD重新计算果RDD仅父RDD(窄赖)种重新计算代价会非常

Spark基Checkpoint(物化)容错机制解?图中宽赖结果(历Shuffle程)昂贵Spark结果物化磁盘备面
join操作两种情况果join操作partition 仅仅已知Partition进行join时join操作窄赖情况join操作宽赖确定Partition数量赖关系窄赖出推窄赖仅包含窄赖包含固定数窄赖(说父RDD赖Partition数量会着RDD数规模改变改变)
Stage划分
名词解析
1 job rdd action 触发动作简单理解需执行 rdd action 时候会生成 job
2stage  stage job 组成单位说 job 会切分成 1 1 stage然 stage 会执行序次执行
3task  stage 务执行单元般说 rdd 少partition会少 task task 处理partition 数
划分规
1前推理遇宽赖断开遇窄赖前RDD加入Stage中
2Stage里面Task数量该Stage中 RDDPartition数量决定
3Stage里面务类型ResultTask前面Stage里面务类型ShuffleMapTask
4代表前Stage算子定该Stage计算步骤
总结:spark中stage划分根shuffle划分宽赖必然shuffle程说spark根宽窄赖划分stage
Spark优化
窄赖优化利逻辑RDD算子forkjoin(join非文join算子指步行务barrier):计算fork分区算完join然forkjoinRDD算子果直接翻译物理实现济:RDD( 中间结果)需物化存存储中费时费空间二join作全局barrier昂贵会慢节点拖死果子RDD分区 父RDD分区窄赖实施典fusion优化两forkjoin合果连续变换算子序列窄赖 forkjoin减少量全局barrier需物化中间结果RDD极提升性Spark做流水线(pipeline)优化
Spark流水线优化:

变换算子序列碰shuffle类操作宽赖发生流水线优化终止具体实现 中DAGScheduler前算子前回溯赖图碰宽赖生成stage容纳已遍历算子序列stage里安全实施流水线优化然宽赖开始继续回溯生成stage
Pipeline
Spark中pipelinepartition应partitionstage部窄赖pipeline详解
stagestage间宽赖
分布式计算程

图Sparkwordcount例子根述stage划分原job划分2stage三行分数读取计算存储程
仅代码户根体会数背行计算图中出数分布分区(理解机器)数flapMapmapreduceByKey算子RDD分区中流转(算子面说RDD进行计算函数)
图更高角度:

Spark运行架构Driver(理解master)Executor(理解workerslave)组成Driver负责户代码进行DAG切分划分Stage然Stage应task调度提交Executor进行计算样Executor行执行Stagetask
(里DriverExecutor进程般分布机器)
里理解Stagetask图Spark作业划分层次:
 
Application户submit提交整体代码代码中action操作action算子Application划分jobjob根宽赖划分StageStage划分许(数量分区决定分区数task计算)功相task然task提交Executor进行计算执行结果返回Driver汇总存储
体现 Driver端总规划–Executor端分计算–结果汇总回Driver 思想分布式计算思想
8Spark算子
分类
方说Spark 算子致分两类
     1)Transformation 变换转换算子:种变换触发提交作业完成作业中间程处理
   Transformation 操作延迟计算说RDD 转换生成 RDD 转换操作马执行需等 Action 操作时候会真正触发运算
     2)Action 行动算子:类算子会触发 SparkContext 提交 Job 作业
    Action 算子会触发 Spark 提交作业(Job)数输出 Spark系统
 
  方说Spark 算子致分三类
  1)Value数类型Transformation算子种变换触发提交作业针处理数项Value型数
  2)KeyValue数类型Transfromation算子种变换触发提交作业针处理数项KeyValue型数
  3)Action算子类算子会触发SparkContext提交Job作业
1)Value数类型Transformation算子  
  输入分区输出分区型
    1map算子
    2flatMap算子
    3mapPartitions算子
    4glom算子
  二输入分区输出分区型 
    5union算子
    6cartesian算子
  三输入分区输出分区型
    7grouBy算子
  四输出分区输入分区子集型
    8filter算子
    9distinct算子
    10subtract算子
    11sample算子
        12takeSample算子
   五Cache型
    13cache算子  
    14persist算子
2)KeyValue数类型Transfromation算子
  输入分区输出分区
    15mapValues算子
  二单RDD两RDD聚集
   单RDD聚集
    16combineByKey算子
    17reduceByKey算子
    18partitionBy算子
   两RDD聚集
    19Cogroup算子
  三连接
    20join算子
    21leftOutJoin rightOutJoin算子
  Spark算子作详细见httpwwwcnblogscomzlslchp5723979html
 3)Action算子
  输出
    22foreach算子
  二HDFS
    23saveAsTextFile算子
    24saveAsObjectFile算子
  三Scala集合数类型
    25collect算子
    26collectAsMap算子
      27reduceByKeyLocally算子
      28lookup算子
    29count算子
    30top算子
    31reduce算子
    32fold算子
    33aggregate算子
  Spark算子作详细见httpwwwcnblogscomzlslchp5723979html
 
 
     1 Transformations 算子
 (1) map
  原 RDD 数项通 map 中户定义函数 f 映射转变新元素源码中 map 算子相初始化 RDD 新 RDD 做 MappedRDD(this scclean(f))
     图 1中方框表示 RDD 分区左侧分区户定义函数 fT>U 映射右侧新 RDD 分区实际等 Action算子触发 f 函数会函数stage 中数进行运算图 1 中第分区数记录 V1 输入 f通 f 转换输出转换分区中数记录 V'1
                            

      图1    map 算子 RDD 转换                   
 
    (2) flatMap
     原 RDD 中元素通函数 f 转换新元素生成 RDD 集合中元素合集合部创建 FlatMappedRDD(thisscclean(f))
  图 2 表 示 RDD 分 区 进 行 flatMap函 数 操 作 flatMap 中 传 入 函 数 fT>U T U 意数类型分区中数通户定义函数 f 转换新数外部方框认 RDD 分区方框代表集合 V1 V2 V3 集合作 RDD 数项存储数组容器转换V'1 V'2 V'3 原数组容器结合拆散拆散数形成 RDD 中数项

图2     flapMap 算子 RDD 转换
    (3) mapPartitions
      mapPartitions 函 数 获 取 分 区 迭 代器 函 数 中 通 分 区 整 体 迭 代 器 整 分 区 元 素 进 行 操 作 部 实 现 生 成
MapPartitionsRDD图 3 中方框代表 RDD 分区图 3 中户通函数 f (iter)>iterf ilter(_>3) 分区中数进行滤等 3 数保留方块代表 RDD 分区含 1 2 3 分区滤剩元素 3


    图3  mapPartitions 算子 RDD 转换
  (4)glom
  glom函数分区形成数组部实现返回GlommedRDD 图4中方框代表RDD分区图4中方框代表分区 该图表示含V1 V2 V3分区通函数glom形成数组Array[(V1)(V2)(V3)]

      图 4   glom算子RDD转换
     (5) union
      union 函数时需保证两 RDD 元素数类型相返回 RDD 数类型合 RDD 元素数类型相进行重操作保存元素果想重
distinct()时 Spark 提供更简洁 union API通 ++ 符号相 union 函数操作
     图 5 中左侧方框代表两 RDD方框方框代表 RDD 分区右侧方框代表合 RDD方框方框代表分区
  含V1V2U1U2U3U4RDD含V1V8U5U6U7U8RDD合元素形成RDDV1V1V2V8形成分区U1U2U3U4U5U6U7U8形成分区

  图 5  union 算子 RDD 转换 

(6) cartesian
        两 RDD 元 素 进 行 笛 卡 尔 积 操 作 操 作 部 实 现 返 回CartesianRDD图6中左侧方框代表两 RDD方框方框代表 RDD 分区右侧方框代表合 RDD方框方框代表分区图6中方框代表RDD方框中方框代表RDD分区
      例 : V1 RDD 中 W1 W2 Q5 进 行 笛 卡 尔 积 运 算 形 成 (V1W1)(V1W2) (V1Q5)
     
       图 6  cartesian 算子 RDD 转换
(7) groupBy
  groupBy :元素通函数生成相应 Key数转化 KeyValue 格式 Key 相元素分组
  函数实现:
  1)户函数预处理:
  val cleanF scclean(f)
  2)数 map 进行函数操作进行 groupByKey 分组操作
     thismap(t > (cleanF(t) t))groupByKey(p)
  中 p 确定分区数分区函数决定行化程度
  图7 中方框代表 RDD 分区相key 元素合组例 V1 V2 合 V Value V1V2形成 VSeq(V1V2)

  图 7 groupBy 算子 RDD 转换
(8) filter
    filter 函数功元素进行滤 元 素 应 f 函 数 返 回 值 true 元 素 RDD 中保留返回值 false 元素滤掉 部 实 现 相 生 成 FilteredRDD(thisscclean(f))
    面代码函数质实现:
    deffilter(fT>Boolean)RDD[T]newFilteredRDD(thisscclean(f))
  图 8 中方框代表 RDD 分区 T 意类型通户定义滤函数 f数项操作满足条件返回结果 true 数项保留例滤掉 V2 V3 保留 V1区分命名 V'1

  图 8  filter 算子 RDD 转换
     
  (9)distinct
  distinctRDD中元素进行重操作图9中方框代表RDD分区通distinct函数数重 例重复数V1 V1重保留份V1

    图9  distinct算子RDD转换
(10)subtract
  subtract相进行集合差操作RDD 1RDD 1RDD 2交集中元素图10中左侧方框代表两RDD方框方框代表RDD分区 右侧方框
代表合RDD方框方框代表分区 V1两RDD中均根差集运算规新RDD保留V2第RDD第二RDD没新RDD元素中包含V2
  
          图10   subtract算子RDD转换
(11) sample
       sample RDD 集合元素进行采样获取元素子集户设定否放回抽样百分机种子进决定采样方式部实现生成 SampledRDD(withReplacement fraction seed)
  函数参数设置:
‰   withReplacementtrue表示放回抽样
‰   withReplacementfalse表示放回抽样
  图 11中 方 框 RDD 分 区 通 sample 函 数 采 样 50 数 V1 V2 U1 U2U3U4 采样出数 V1 U1 U2 形成新 RDD
     
       图11  sample 算子 RDD 转换
  (12)takeSample
  takeSample()函数面sample函数原理相例采样设定采样数进行采样时返回结果RDD相采样数进行
Collect()返回结果集合单机数组
  图12中左侧方框代表分布式节点分区右侧方框代表单机返回结果数组 通takeSample数采样设置采样份数返回结果V1

  图12    takeSample算子RDD转换
  (13) cache
     cache  RDD 元素磁盘缓存存 相 persist(MEMORY_ONLY) 函数功
     图13 中方框代表 RDD 分区左侧相数分区存储磁盘通 cache 算子数缓存存
      
      图 13 Cache 算子 RDD 转换
  (14) persist
      persist 函数 RDD 进行缓存操作数缓存里 StorageLevel 枚举类型进行确定 种类型组合(见10) DISK 代表磁盘MEMORY 代表存 SER 代表数否进行序列化存储
  面函数定义 StorageLevel 枚举类型代表存储模式户通图 141 需进行选择
  persist(newLevelStorageLevel)
  图 141 中列出persist 函数进行缓存模式例MEMORY_AND_DISK_SER 代表数存储存磁盘序列化方式存储理

            图 141  persist 算子 RDD 转换
  图 142 中方框代表 RDD 分区 disk 代表存储磁盘 mem 代表存储存数初全部存储磁盘通 persist(MEMORY_AND_DISK) 数缓存存
分区法容纳存含 V1 V2 V3 RDD存储磁盘含U1U2RDD旧存储存

      图 142   Persist 算子 RDD 转换
(15) mapValues
      mapValues :针(Key Value)型数中 Value 进行 Map 操作 Key 进行处理
    图 15 中方框代表 RDD 分区 a>a+2 代表 (V11) 样 Key Value 数数 Value 中 1 进行加 2 操作返回结果 3
     
      图 15   mapValues 算子 RDD 转换
(16) combineByKey
  面代码 combineByKey 函数定义:
  combineByKey[C](createCombiner(V) C
  mergeValue(C V) C
  mergeCombiners(C C) C
  partitionerPartitioner
  mapSideCombineBooleantrue
  serializerSerializernull)RDD[(KC)]
说明:
‰   createCombiner: V > C C 存情况通 V 创建 seq C
‰   mergeValue: (C V) > C C 已存情况需 merge item V
加 seq C 中者叠加
   mergeCombiners: (C C) > C合两 C
‰   partitioner: Partitioner Shuff le 时需 Partitioner
‰   mapSideCombine : Boolean true减传输量 combine map
端先做叠加先 partition 中相 key value 叠加
shuff le
‰   serializerClass: String null传输需序列化户定义序列化类:
  例相元素 (Int Int) RDD 转变 (Int Seq[Int]) 类型元素 RDD图 16中方框代表 RDD 分区图通 combineByKey (V12) (V11)数合( V1Seq(21))
  
      图 16  comBineByKey 算子 RDD 转换
(17) reduceByKey
     reduceByKey combineByKey 更简单种情况两值合成值( Int Int V)to (Int Int C)叠加 createCombiner reduceBykey 简单直接返回 v mergeValue mergeCombiners 逻辑相没区
    函数实现:
    def reduceByKey(partitioner Partitioner func (V V) > V) RDD[(K V)]
{
combineByKey[V]((v V) > v func func partitioner)
}
  图17中方框代表 RDD 分区通户定义函数 (AB) > (A + B) 函数相 key 数 (V12) (V11) value 相加运算结果( V13)
     
        图 17 reduceByKey 算子 RDD 转换
(18)partitionBy
  partitionBy函数RDD进行分区操作
  函数定义
  partitionBy(partitioner:Partitioner)
  果原RDD分区器现分区器(partitioner)致重分区果致相根分区器生成新ShuffledRDD
  图18中方框代表RDD分区 通新分区策略原分区V1 V2数合分区

 
    图18  partitionBy算子RDD转换
 (19)Cogroup
   cogroup函数两RDD进行协划分cogroup函数定义
  cogroup[W](other: RDD[(K W)] numPartitions: Int): RDD[(K (Iterable[V] Iterable[W]))]
  两RDD中KeyValue类型元素RDD相Key元素分聚合集合返回两RDD中应Key元素集合迭代器
  (K (Iterable[V] Iterable[W]))
  中KeyValueValue两RDD相Key两数集合迭代器构成元组
  图19中方框代表RDD方框方框代表RDD中分区 RDD1中数(U11) (U12)RDD2中数(U12)合(U1((12)(2)))

        图19  Cogroup算子RDD转换
 (20) join
       join 两需连接 RDD 进行 cogroup函数操作相 key 数够放分区 cogroup 操作形成新 RDD key 元素进行笛卡尔积操作返回结果展应 key 元组形成集合返回 RDD[(K (V W))]
   面 代 码 join 函 数 实 现 质 通 cogroup 算 子 先 进 行 协 划 分 通 flatMapValues 合数散
       thiscogroup(otherpartitioner)f latMapValues{case(vsws) > for(v图 20两 RDD join 操作示意图方框代表 RDD方框代表 RDD 中分区函数相 key 元素 V1 key 做连接结果 (V1(11)) (V1(12))

                    图 20   join 算子 RDD 转换
(21)eftOutJoinrightOutJoin
  LeftOutJoin(左外连接)RightOutJoin(右外连接)相join基础先判断侧RDD元素否空果空填充空 果空数进行连接运算
返回结果
面代码leftOutJoin实现
if (wsisEmpty) {
vsmap(v > (v None))
} else {
for (v < vs w < ws) yield (v Some(w))
}
 
2 Actions 算子
  质 Action 算子中通 SparkContext 进行提交作业 runJob 操作触发RDD DAG 执行
例 Action 算子 collect 函数代码感兴趣读者着入口进行源码剖析:

**
* Return an array that contains all of the elements in this RDD
*
def collect() Array[T] {
* 提交 Job*
val results scrunJob(this (iter Iterator[T]) > itertoArray)
Arrayconcat(results _*)
}
(22) foreach
  foreach RDD 中元素应 f 函数操作返回 RDD Array 返回Uint图22表示 foreach 算子通户定义函数数项进行操作例中定义函数 println()控制台印数项
  
      图 22 foreach 算子 RDD 转换
  (23) saveAsTextFile
  函数数输出存储 HDFS 指定目录
面 saveAsTextFile 函数部实现部
  通调 saveAsHadoopFile 进行实现:
thismap(x > (NullWritableget() new Text(xtoString)))saveAsHadoopFile[TextOutputFormat[NullWritable Text]](path)
RDD 中元素映射转变 (null xtoString)然写入 HDFS
  图 23中左侧方框代表 RDD 分区右侧方框代表 HDFS Block通函数RDD 分区存储 HDFS 中 Block
  
            图 23   saveAsHadoopFile 算子 RDD 转换
  (24)saveAsObjectFile
  saveAsObjectFile分区中10元素组成Array然Array序列化映射(NullBytesWritable(Y))元素写入HDFSSequenceFile格式
  面代码函数部实现
  map(x>(NullWritableget()new BytesWritable(Utilsserialize(x))))
  图24中左侧方框代表RDD分区右侧方框代表HDFSBlock 通函数RDD分区存储HDFSBlock

            图24 saveAsObjectFile算子RDD转换
 
 (25) collect
  collect 相 toArray toArray 已时推荐 collect 分布式 RDD 返回单机 scala Array 数组数组运 scala 函数式操作
  图 25中左侧方框代表 RDD 分区右侧方框代表单机存中数组通函数操作结果返回 Driver 程序节点数组形式存储

  图 25   Collect 算子 RDD 转换  
(26)collectAsMap
  collectAsMap(KV)型RDD数返回单机HashMap 重复KRDD元素面元素覆盖前面元素
  图26中左侧方框代表RDD分区右侧方框代表单机数组 数通collectAsMap函数返回Driver程序计算结果结果HashMap形式存储

 
          图26 CollectAsMap算子RDD转换
 
 (27)reduceByKeyLocally
  实现先reducecollectAsMap功先RDD整体进行reduce操作然收集结果返回HashMap
 (28)lookup
面代码lookup声明
lookup(key:K):Seq[V]
Lookup函数(KeyValue)型RDD操作返回指定Key应元素形成Seq 函数处理优化部分果RDD包含分区器会应处理K分区然返回(KV)形成Seq 果RDD包含分区器需全RDD元素进行暴力扫描处理搜索指定K应元素
  图28中左侧方框代表RDD分区右侧方框代表Seq结果返回Driver节点应中

      图28  lookupRDD转换
(29) count
  count 返回整 RDD 元素数
  部函数实现:
  defcount()LongscrunJob(thisUtilsgetIteratorSize_)sum
  图 29中返回数数 5方块代表 RDD 分区

     图29 count RDD 算子转换
(30)top
top返回k元素 函数定义
top(num:Int)(implicit ord:Ordering[T]):Array[T]
相函数说明
·top返回k元素
·take返回k元素
·takeOrdered返回k元素返回数组中保持元素序
·first相top(1)返回整RDD中前k元素定义排序方式Ordering[T]
返回含前k元素数组
 
(31)reduce
  reduce函数相RDD中元素进行reduceLeft函数操作 函数实现
  Some(iterreduceLeft(cleanF))
  reduceLeft先两元素进行reduce函数操作然结果迭代器取出元素进行reduce函数操作直迭代器遍历完元素结果RDD中先分区中元素集合分进行reduceLeft 分区形成结果相元素结果集合进行reduceleft操作
  例:户定义函数
  f:(AB)>(A_1+@+B_1A_2+B_2)
  图31中方框代表RDD分区通户定函数f数进行reduce运算 示例
返回结果V1@[1]V2U@U2@U3@U412

 
图31 reduce算子RDD转换
(32)fold
  foldreduce原理相reduce相reduce时迭代器取第元素zeroValue
  图32中通面户定义函数进行fold运算图中方框代表RDD分区 读者参reduce函数理解
  fold((V0@2))( (AB)>(A_1+@+B_1A_2+B_2))


          图32  fold算子RDD转换
 (33)aggregate
   aggregate先分区元素进行aggregate操作分区结果进行fold操作
  aggreagatefoldreduce处aggregate相采方式进行数聚集种聚集行化 foldreduce函数运算程中分区中需进行串行处理分区串行计算完结果结果前方式进行聚集返回终聚集结果
  函数定义
aggregate[B](z: B)(seqop: (BA) > Bcombop: (BB) > B): B
  图33通户定义函数RDD 进行aggregate聚集操作图中方框代表RDD分区
  rddaggregate(V0@2)((AB)>(A_1+@+B_1A_2+B_2))(AB)>(A_1+@+B_1A_@+B_2))
  介绍两计算模型中两特殊变量
  广播(broadcast)变量:广泛广播Map Side Join中表广播变量等场景 数集合单节点存够容纳需RDD样节点间散存储
Spark运行时广播变量数发节点保存续计算复 相Hadoodistributed cache广播容跨作业享 Broadcast底层实现采BT机制

        图33  aggregate算子RDD转换
②代表V
③代表U
accumulator变量:允许做全局累加操作accumulator变量广泛应中记录
前运行指标情景

详细介绍
官方文档列举32种常见算子包括Transformation20种操作Action12种操作
(注:截图windows运行结果)
Transformation:
1map
map输入变换函数应RDD中元素mapPartitions应分区区mapPartitions调粒度parallelize(1 to 10 3)map函数执行10次mapPartitions函数执行3次

2filter(function)
滤操作满足filterfunction函数trueRDD元素组成新数集:filter(a 1)

3flatMap(function)
mapRDD中元素逐进行函数操作映射外RDDflatMap操作函数应RDD中元素返回迭代器容构成新RDDflatMap操作函数应RDD中元素返回迭代器容构成RDD
flatMapmap区map映射flatMap先映射扁化map次(func)产生元素返回象flatMap步象合象

4mapPartitions(function)
区foreachPartition(属Action返回值)mapPartitions获取返回值map区前面已提单独运行RDD分区(block)类型TRDD运行时(function)必须Iterator > Iterator类型方法(入参)

5mapPartitionsWithIndex(function)
mapPartitions类似需提供表示分区索引值整型值作参数function必须(int Iterator)>Iterator类型

6sample(withReplacement fraction seed)
采样操作样中取出部分数withReplacement否放回fraction采样例seed指定机数生成器种子(否放回抽样分truefalsefraction取样例(0 1]seed种子整型实数)

7union(otherDataSet)
源数集数集求集重

8intersection(otherDataSet)
源数集数集求交集重序返回

9distinct([numTasks])
返回源数集重新数集重局部序整体序返回(详细介绍见
httpsblogcsdnnetFortuna_iarticledetails81506936)
注:groupByKeyreduceByKeyaggregateByKeysortByKeyjoincogroup等Transformation操作均包含[numTasks]务数参数参考行链接理解

注:pairRDD进行操作添加pairRDD简易创建程

10groupByKey([numTasks])
PairRDD(kv)RDD调返回(kIterable)作相键值分组集合序列中序确定groupByKey键值集合加载存中存储计算键应值太易导致存溢出
前求集union方法pair1pair2变相键值pair3进行groupByKey

11reduceByKey(function[numTasks])
groupByKey类似(a1) (a2) (b1) (b2)groupByKey产生中间结果( (a1) (a2) ) ( (b1) (b2) )reduceByKey(a3) (b3)
reduceByKey作聚合groupByKey作分组(functionkey值进行聚合)

12aggregateByKey(zeroValue)(seqOp combOp [numTasks])
类似reduceByKeypairRDD中想key值进行聚合操作初始值(seqOp中combOpenCL中未)应返回值pairRDD区aggregate(返回值非RDD)

13sortByKey([ascending] [numTasks])
样基pairRDD根key值进行排序ascending升序默认true升序numTasks

14join(otherDataSet[numTasks])
加入RDD(kv)(kw)类型dataSet调返回(k(vw))pair dataSet

15cogroup(otherDataSet[numTasks])
合两RDD生成新RDD实例中包含两Iterable值第表示RDD1中相值第二表示RDD2中相值(key值)操作需通partitioner进行重新分区需执行次shuffle操作(两RDD前进行shuffle需)

16cartesian(otherDataSet)
求笛卡尔积该操作会执行shuffle操作

17pipe(command[envVars])
通shell命令RDD分区进行道化通pipe变换shell命令Spark中生成新RDD:

(图莫怪^_^)
18coalesce(numPartitions)
重新分区减少RDD中分区数量numPartitions

19repartition(numPartitions)
repartitioncoalesce接口中shuffletrue简易实现Reshuffle RDD机分区分区数量衡分区分区数远原分区数需shuffle

20repartitionAndSortWithinPartitions(partitioner)
该方法根partitionerRDD进行分区结果分区中key进行排序
 
Action:
1reduce(function)
reduceRDD中元素两两传递输入函数时产生新值新值RDD中元素传递输入函数直值止

2collect()
RDDArray数组形式返回中元素(具体容参见:
httpsblogcsdnnetFortuna_iarticledetails80851775)

3count()
返回数集中元素数默认Long类型

4first()
返回数集第元素(类似take(1))

5takeSample(withReplacement num [seed])
数集进行机抽样返回包含num机抽样元素数组withReplacement表示否放回抽样参数seed指定生成机数种子
该方法仅预期结果数组情况数加载driver端存中

6take(n)
返回包含数集前n元素数组(0标n1标元素)排序

7takeOrdered(n[ordering])
返回RDD中前n元素默认序排序(升序)者定义较器序排序

8saveAsTextFile(path)
dataSet中元素文文件形式写入文件系统者HDFS等Spark元素调toString方法数元素转换文文件中行记录
文件保存文件系统会保存executor机器目录


9saveAsSequenceFile(path)(Java and Scala)
dataSet中元素Hadoop SequenceFile形式写入文件系统者HDFS等(pairRDD操作)


10saveAsObjectFile(path)(Java and Scala)
数集中元素ObjectFile形式写入文件系统者HDFS等


11countByKey()
统计RDD[KV]中K数量返回具key计数(kint)pairshashMap

12foreach(function)
数集中元素运行函数function

补充:Spark23官方文档中原[numTasks]务数参数改[numPartitions]分区数
实践
SparkRDD算子分两类:TransformationAction
Transformation:延迟加载数Transformation会记录元数信息计算务触发Action时会真正开始计算
Action:立加载数开始计算
创建RDD方式两种:
1通sctextFile(rootwordstxt)文件系统中创建 RDD
2#通行化scala集合创建RDD:val rdd1 scparallelize(Array(12345678))
1简单算子说明
里先说简单Transformation算子
通行化scala集合创建RDD
val rdd1  scparallelize(Array(12345678))
查该rdd分区数量
rdd1partitionslength
map方法scala中样List中数出做函数运算
sortBy:数进行排序
val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>xtrue)
filter:List中数进行函数造作挑选出10值
val rdd3 rdd2filter(_>10)
collect:终结果显示出
flatMap数先进行map操作进行flat(碾压)操作
rdd4flatMap(_split(’ ))collect
运行效果图


val rdd1 scparallelize(List(56473829110))
val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>xtrue)
val rdd3 rdd2filter(_>10)
val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>x+true)
val rdd2 scparallelize(List(56473829110))map(_*2)sortBy(x>xtoStringtrue)


intersection求交集
val rdd9 rdd6intersection(rdd7)
val rdd1 scparallelize(List((tom 1) (jerry 2) (kitty 3)))
val rdd2 scparallelize(List((jerry 9) (tom 8) (shuke 7)))


join
val rdd3 rdd1join(rdd2)

val rdd3 rdd1leftOuterJoin(rdd2)

val rdd3 rdd1rightOuterJoin(rdd2)

union:求集注意类型致
val rdd6 scparallelize(List(5647))
val rdd7 scparallelize(List(1234))
val rdd8 rdd6union(rdd7)
rdd8distinctsortBy(x>x)collect


groupByKey
val rdd3 rdd1 union rdd2
rdd3groupByKey
rdd3groupByKeymap(x>(x_1x_2sum))


cogroup
val rdd1 scparallelize(List((tom 1) (tom 2) (jerry 3) (kitty 2)))
val rdd2 scparallelize(List((jerry 2) (tom 1) (shuke 2)))
val rdd3 rdd1cogroup(rdd2)
val rdd4 rdd3map(t>(t_1 t_2_1sum + t_2_2sum))


cartesian笛卡尔积
val rdd1 scparallelize(List(tom jerry))
val rdd2 scparallelize(List(tom kitty shuke))
val rdd3 rdd1cartesian(rdd2)


接说简单Action算子
val rdd1 scparallelize(List(12345) 2)
#collect
rdd1collect
#reduce
val rdd2 rdd1reduce(+)
#count
rdd1count
#top
rdd1top(2)
#take
rdd1take(2)
#first(similer to take(1))
rdd1first
#takeOrdered
rdd1takeOrdered(3)

2复杂算子说明
mapPartitionsWithIndex  partition中分区号应值出 源码
val func (index Int iter Iterator[(Int)]) > {
itertoListmap(x > [partID + index + val + x + ])iterator
}
val rdd1 scparallelize(List(123456789) 2)
rdd1mapPartitionsWithIndex(func)collect


aggregate
def func1(index Int iter Iterator[(Int)]) Iterator[String] {
itertoListmap(x > [partID + index + val + x + ])iterator
}
val rdd1 scparallelize(List(123456789) 2)
rdd1mapPartitionsWithIndex(func1)collect
###action操作 第参数初始值 二2函数(第函数先分区进行合 第二函数分区合结果进行合)
###0 + (0+1+2+3+4 + 0+5+6+7+8+9)
rdd1aggregate(0)(_+_ _+_)
· 1


rdd1aggregate(0)(mathmax( ) _ + _)
###0分01分区List元素分区中值里分37然0+3+710


###51 52345 –> 567899 –> 5 + (5+9)
rdd1aggregate(5)(mathmax( ) _ + _)

val rdd3 scparallelize(List(12233454567)2)
rdd3aggregate()((xy) > mathmax(xlength ylength)toString (xy) > x + y)
######### length分两分区元素length进行较0分区字符串21分区字符串4然结果返回分先结果2442


val rdd4 scparallelize(List(1223345)2)
rdd4aggregate()((xy) > mathmin(xlength ylength)toString (xy) > x + y)
######## length012较字符串0然字符串023较值1


aggregateByKey
val pairRDD scparallelize(List( (cat2) (cat 5) (mouse 4)(cat 12) (dog 12) (mouse 2)) 2)
def func2(index Int iter Iterator[(String Int)]) Iterator[String] {
itertoListmap(x > [partID + index + val + x + ])iterator
}
pairRDDmapPartitionsWithIndex(func2)collect


pairRDDaggregateByKey(0)(mathmax( ) _ + _)collect
########## 先0号分区中数进行操作(初始值数进行较)(cat5)(mouse4)然1号分区中数进行操作(cat12)(dog12)(mouse2)然两分区数进行相加终结果


coalesce
#coalesce(2 false)代表数重新分成2区进行shuffle(数重新进行机分配数通网络分配机器)
val rdd1 scparallelize(1 to 10 10)
val rdd2 rdd1coalesce(2 false)
rdd2partitionslength


repartition
repartition效果等coalesce(x true)

collectAsMap Map(b > 2 a > 1)
val rdd scparallelize(List((a 1) (b 2)))
rddcollectAsMap


combineByKey reduceByKey相效果
###第参数x原封动取出 第二参数函数 局部运算 第三函数 局部运算结果做运算
###分区中key中value中第值 (hello1)(hello1)(good1)–>(hello(11)good(1))–>x相hello第1 good中1
val rdd1 sctextFile(hdfsmaster9000wordcountinput)flatMap(split( ))map(( 1))
val rdd2 rdd1combineByKey(x > x (a Int b Int) > a + b (m Int n Int) > m + n)
rdd1collect


###input3文件时(3block块分三区 3文件3block ) 会加310
val rdd3 rdd1combineByKey(x > x + 10 (a Int b Int) > a + b (m Int n Int) > m + n)
rdd3collect


val rdd4 scparallelize(List(dogcatgnusalmonrabbitturkeywolfbearbee) 3)
val rdd5 scparallelize(List(112221222) 3)
val rdd6 rdd5zip(rdd4)


第参数List(_)代表第元素转换List第 二参数x List[String] y String) > x + y代表元素y加入list中第三参数(m List[String] n List[String]) > m ++ n)代表两分区list合成新List
val rdd7 rdd6combineByKey(List(_) (x List[String] y String) > x + y (m List[String] n List[String]) > m ++ n)


countByKey
val rdd1 scparallelize(List((a 1) (b 2) (b 2) (c 2) (c 1)))
rdd1countByKey
rdd1countByValue


filterByRange
val rdd1 scparallelize(List((e 5) (c 3) (d 4) (c 2) (a 1)))
val rdd2 rdd1filterByRange(b d)
rdd2collect


flatMapValues  Array((a1) (a2) (b3) (b4))
val rdd3 scparallelize(List((a 1 2) (b 3 4)))
val rdd4 rdd3flatMapValues(_split( ))
rdd4collect


foldByKey
val rdd1 scparallelize(List(dog wolf cat bear) 2)
val rdd2 rdd1map(x > (xlength x))
val rdd3 rdd2foldByKey()(+)


keyBy 传入参数做key
val rdd1 scparallelize(List(dog salmon salmon rat elephant) 3)
val rdd2 rdd1keyBy(_length)
rdd2collect


keys values
val rdd1 scparallelize(List(dog tiger lion cat panther eagle) 2)
val rdd2 rdd1map(x > (xlength x))
rdd2keyscollect
rdd2valuescollect


方法英文解释
#
map(func)
Return a new distributed dataset formed by passing each element of the source through a function func
filter(func)
Return a new dataset formed by selecting those elements of the source on which func returns true
flatMap(func)(部执行序右左先执行Map执行Flat)
Similar to map but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)
mapPartitions(func)
Similar to map but runs separately on each partition (block) of the RDD so func must be of type Iterator > Iterator when running on an RDD of type T
mapPartitionsWithIndex(func)
Similar to mapPartitions but also provides func with an integer value representing the index of the partition so func must be of type (Int Iterator) > Iterator when running on an RDD of type T
sample(withReplacement fraction seed)
Sample a fraction fraction of the data with or without replacement using a given random number generator seed
union(otherDataset)
Return a new dataset that contains the union of the elements in the source dataset and the argument
intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source dataset and the argument
distinct([numTasks]))
Return a new dataset that contains the distinct elements of the source dataset
groupByKey([numTasks])
When called on a dataset of (K V) pairs returns a dataset of (K Iterable) pairs
reduceByKey(func [numTasks])
When called on a dataset of (K V) pairs returns a dataset of (K V) pairs where the values for each key are aggregated using the given reduce function func which must be of type (VV) > V Like in groupByKey the number of reduce tasks is configurable through an optional second argument
aggregateByKey(zeroValue)(seqOp combOp [numTasks])
When called on a dataset of (K V) pairs returns a dataset of (K U) pairs where the values for each key are aggregated using the given combine functions and a neutral zero value Allows an aggregated value type that is different than the input value type while avoiding unnecessary allocations Like in groupByKey the number of reduce tasks is configurable through an optional second argument
sortByKey([ascending] [numTasks])
When called on a dataset of (K V) pairs where K implements Ordered returns a dataset of (K V) pairs sorted by keys in ascending or descending order as specified in the boolean ascending argument
join(otherDataset [numTasks])
When called on datasets of type (K V) and (K W) returns a dataset of (K (V W)) pairs with all pairs of elements for each key Outer joins are supported through leftOuterJoin rightOuterJoin and fullOuterJoin
cogroup(otherDataset [numTasks])
When called on datasets of type (K V) and (K W) returns a dataset of (K (Iterable Iterable)) tuples This operation is also called groupWith
cartesian(otherDataset)
When called on datasets of types T and U returns a dataset of (T U) pairs (all pairs of elements)
pipe(command [envVars])
Pipe each partition of the RDD through a shell command eg a Perl or bash script RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions Useful for running operations more efficiently after filtering down a large dataset
repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them This always shuffles all data over the network
repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and within each resulting partition sort records by their keys This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery
(K(IterableIterable))

9Spark RDD
着时间推移数分析已达新程度反改变运作模式期天数分析仅处理量数具快速周转时间定目标然Hadoop数分析背伦技术快速处理方面存足着Spark出现数处理速度便更期
谈Spark时想第术语弹性分布式数集(RDD)Spark RDD数处理更快外Spark关键特性支持计算期间数集进行逻辑分区
文中讨Spark RDD技术方面进步解Spark RDD底层技术细节外概述RDDSpark中
Spark RDD特性
RDD定义Resilient Distributed Dataset(弹性分布式数集)中术语表示特性
· Resilient 通RDD谱系图(DAG)实现容错节点发生障时进行重新计算
· Distributed  Spark RDD数集驻留节点中
· Dataset 您数记录
Hadoop设计中RDD挑战然Spark RDD解决方案似非常高效取决惰性计算Spark中RDDs需工作节省量数处理时间整程效率
Hadoop Mapreduce通特性克服Spark RDD许缺点Spark RDD流行原
Spark RDD核心特性
· 存计算
· 惰性计算
· 容错
· 变性
· 分区
· 持久性
· 粗粒度操作
· 位置粘性
节中逐步讨问题
 
Spark RDD种表示分布节点数集技术行操作换句话说Spark RDDApache Spark容错抽象Apache Spark基数结构
Spark中RDD变分布式象集合支持两种方法
· cache()
· persist()
 
Spark RDD存缓存技术Spark RDD中数集进行逻辑分区存缓存处果数合适会余数发送磁盘进行重新计算什称弹性您需时Spark中提取RDD整数处理更快
Spark数处理方面Hadoop快100倍面Apache Spark更快素
 
 
Spark RDD支持操作
Spark中RDD支持两种类型操作
1 Transformations
2 Actions
 
Transformation
transformation情况Spark RDD现数集创建新数集引Spark RDD转换示例说map转换通函数传递dataset元素作返回值发送表示结果新RDD
Scala
val l sctextFile(exampletxt)

val lLengths lmap(s > slength)

val totalLength lLengthsreduce((a b) > a + b)
果想llengthpersist()函数示
lLengths persist()
您httpssparkapacheorg参考API文档获Spark RDD支持转换详细列表
 
Spark RDD支持两种类型转换
1 Narrow transformation
2 Wide transformation
Narrow transformation情况输出RDD父RDD数分区相关联广泛转换中输出RDD许父RDD分区结果换句话说谓shuffle transformation
 
Spark RDD转换惰性会立计算结果相反记住基数集应转换数集引文件示例示操作需结果时转换Spark RDD中计算进导致更快更效数处理
 
次转换RDD运行操作时会Spark RDD中进行重新计算persist方法Spark元素保存集群中便次查询时更快访问支持磁盘持久存储Spark RDDs跨节点进行复制
 
 
Actions
操作期间RDD数集执行计算值返回驱动程序例reduce某函数聚合RDD元素终结果返回程序操作
创建Spark RDD三程
1 行集合
2 外部数集(外部存储系统享文件系统HBaseHDFS)
3 现Apache Spark RDDs
接讨方法中种解创建Spark RDDs
 
弹性分布式数集(RDD)Apache Spark重特性非常重解Apache Spark数行业中重性
 
行集合
您通JavaScalaPython中现驱动程序集合调SparkContext接口parallelize方法创建行集合例中复制集合元素构成分布式数集行操作
 
Scala中行化集合Spark RDD例子
数字26保存行集合:
val collection Array(2 3 4 56)

val prData sparksparkContextparallelize(collection)
 
里创建分布式数集prData够行操作您调prDatareduce()数组中元素相加
 
行化集合关键参数决定数集分割成分区号例中Spark集群分区运行单务通常集群中单CPU 24分区理想Spark会根集群动设置分区数量户通作行化第二参数传递手动设置
 
外部数库
Apache SparkHadoop支持文件存储创建分布式数集中包括
· Local file system
· HDFS
· Cassandra
· HBase
· Amazon S3
Spark支持类似文件格式:
· Text files
· Sequence Files
· CSV
· JSON
· Any Hadoop Input Format
例SparkContext接口textFile方法创建文文件Spark RDDs方法接受文件URL系统路径hdfs等等)文件作行集合读取
 
里重素果文件系统路径必须节点相路径访问该文件必须数文件复制节点需网络挂载享文件系统
 
您数帧读取器接口加载外部数集然 RDD方法数集转换RDD
 
面文文件转换示例稍返回字符串数集
val exDataRDD sparkreadtextFile(pathoftextfile)rdd
 
 
现RDDS
RDDS变改变transformation您现RDD创建新RDD没突变发生变化集群中保持致性目操作少
· map
· filter
· count
· distinct
· flatmap
例:
val seasons sparksparkContextparallelize(Seq(summer monsoon spring winter))

val seasons1 seasonsmap(s > (scharAt(0) s))

关系型数库数性优化解决方案分表(前表历史表)表分区数清理原
原目
· 交易量者日积月累造成数库数量越越会导致系统性幅降部分业务表数作备份清理
· 减少数量提升请求响应速度提升户体验
数否需清理阀值判断
通常表磁盘超 5GB OLTP 系统(联机事务处理)表记录超 3000 万应考虑表进行分区者分表
述阀值外根数库性指标情况考虑分区者分表已充分挖掘表设计索引设计查询设计等单表性优化手段然满足业务需时候表容量记录尚未达述阀值考虑分区者分表时间点记录数该表阀值
般讲记录数标记阀值会表磁盘容量更容易操作般达该阀值时记录数作阀值文会作阀值表数格式索引设计单位事务处理时间容忍度阀值
满负载周期判断
说张空表需久达阀值周期称满负载周期
果业务量已稳定数量积累前阀值需时间该表满负载周期果业务量稳定升数量递增根递增速度估算出满负载周期——说什时候达阀值什时候进行迁移做提前规划心中数
迁移周期判断
该表阀值需迁移周期进行判断久该表迁移次?
天迁移历史表次迁移周期天样操作简单完全做成定时务系统干频繁
三分法满负载周期找出 13 周期数进行迁移迁移周期 13 满负载周期样然较计划性操作稍微复杂(特业务量稳定时候)次迁移需开发运维计划参
数流历史
前表 > 历史表 > 备份表

图示绿色代表前表表示然存高频率写入操作表查询性非常高黄色代表历史表表示写入频率已查询需求查询性高灰色代表备份表表示数张表中已提供写入操作视情况会提供查询操作查询性般
类型数分区方案
· 联机交易(实时类交易写频繁)提供日交易查询建立AB表进行日切日切换提供服务张表进行迁移迁移联机交易进入联机历史表原表(A B 表)执行 truncate 操作历史表交易创建时间(更新时间完成时间)进行 range 分区操作(月分区具体业务量递增情况)
· 清分交易(非实时类交易读频繁)供日外联机交易查询台查询清分等数操作月清分交易进行分表操作样做话前端应需清分查询进行数库路跨月查询结果进行整合滤清分表建立前表历史表机制根迁移周期进行定期迁移
· 日志类数非实时类交易
· 户认证信息(访问度高读写)根业务查询需考虑时间进行 range 分区(推荐 range扩展性较高般讲业务量暴增满足需)者户 id 进行 hash range 分区分区字段选取参考注意点中列举事项
历史表清理方案
· 交易类(包括联机清分分润)数5年清理次
· 日志类数5年清理次
· 户认证商户认证类数直保留
注意点
· 日志类表(操作日志系统日志结果记录务记录等)分区处理:时间进行分区
· 交易类表(流水明细账差异等)分区处理:创建时间更新时间进行分区
· 通知类表分区处理:时间进行分区
· 分区字段选取:般常查询字段进行分区样会助提高查询速度建议 id 进行分区非业务系统里固定 id 查询特否仅分区索引浪费会没分区慢
· range 分区 hash 分区做分区时候考虑分区扩展性原分区 2 年应该考虑重新分区事情分区期根业务量增长情况加 2 年分区…类推
· 分区数较均匀太太少根分区字段快定位分区范围
· 具体分区数量少合适?原数范围缩做全盘扫描会慢时候佳视具体情况十万百万等
· 相关业务操作(SQL)量分区部完成必须跨分区提取话建议行提取提高速度

数仓库缓慢变化维(Slow changing demenison) 实现方案
缓慢变化维定义
Wikipedia中定义:
Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location customer information or product information
Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes
意说数会发生缓慢变化维度缓慢变化维
举例子清楚:
零售业数仓库中事实表保存着销售员销售记录某天销售员北京分公司调海分公司保存变化呢?说销售员维度恰处理变化先回答问题什处理保存变化?果统计北京区海区总销售情况时候销售员销售记录应该算北京算海?然调离前算北京调离算海标记销售员属区域?里需处理维度数缓慢变化维需做事情
处理缓慢变化维般情况种解决方案:
新数覆盖旧数

方法必须前提条件关心数剧变化例某销售员英文名改果关心员工英文名什变化直接覆盖(修改)数仓库中数  

二 保存条记录添加字段加区分  
种情况直接新添条记录时保留原记录单独专字段保存区:  
(表格中Supplier_State表示面例子中属区域描述清晰代理键表示)  

Supplier_key Supplier_Code Supplier_Name Supplier_State Disable  
001 ABC Phlogistical Supply Company CA Y  
002 ABC Phlogistical Supply Company IL N  
:  

Supplier_key Supplier_Code Supplier_Name Supplier_State Version  
001 ABC Phlogistical Supply Company CA 0  
002 ABC Phlogistical Supply Company IL 1  

两种添加数版信息否标识新旧数  
面种添加记录生效日期失效日期标识新旧数:  

Supplier_key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date  
001 ABC Phlogistical Supply Company CA 01Jan2000 21Dec2004  
002 ABC Phlogistical Supply Company IL 22Dec2004  

空End_Date表示前版数者默认时间 ( 12319999)代空值 样数索引识  

三 字段保存值  

Supplier_key Supplier_Name Original_Supplier_State Effective_Date Current_Supplier_State  
001 Phlogistical Supply Company CA 22Dec2004 IL  

种方法字段保存变化痕迹种方法象第二种方法样保存变化记录保存两次变化记录适变化超两次维度  

四 外建表保存历史记录  

外建历史表表存变化历史记录维度保存前数  

Supplier  
Supplier_key Supplier_Name Supplier_State  
001 Phlogistical Supply Company IL  

Supplier_History  
Supplier_key Supplier_Name Supplier_State Create_Date  
001 Phlogistical Supply Company CA 22Dec2004  

种方法仅仅记录变化历史痕迹实做起统计运算方便  

五 混合模式  
种模式种模式混合体相言种方法更全面更应错综复杂易变化户需求较常  

Row_Key Supplier_key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date Current Indicator 
1 001 ABC001 Phlogistical Supply Company CA 22Dec2004 15Jan2007 N  
2 001 ABC001 Phlogistical Supply Company IL 15Jan2007 1Jan2099 Y  


中方法条优点:  
1 简单滤条件选出维度前值  
2 较容易关联出历史意时刻事实数值  
3 果事实表中时间字段(:Order Date Shipping Date Confirmation Date)容易选择条维度数进行关联分析  

中Row_Key Current Indicator字段加更方便毕竟维度表数点冗余字段占太空间提高查询效率  
种设计模式事实表应Supplier_key外键然字段唯标识条维度数形成事实表维表关系做事实维度做关联时应加时间戳字段(Indicator字段)  

六 非常规混合模式  
面说第五种实现方式点弊端事实表维表关系关系种关系建模时解决报表层面报表运行时解决BI语意层建模时需添加时间滤条件较繁琐  
面种解决方案解决关系修改事实表:  

Supplier Dimension  
Version_Number Supplier_key Supplier_Code Supplier_Name Supplier_State Start_Date End_Date  
1 001 ABC001 Phlogistical Supply Company CA 22Dec2004 15Jan2007  
0 001 ABC001 Phlogistical Supply Company IL 15Jan2007 1Jan2099  

Fact Delivery (描述清晰样代理键标识维度)  
Delivery_Key Supplier_key Supplier_version_number Quantity Product Delivery_Date Order_Date  
1 001 0 132 Bags 22Dec2006 15Oct2006  
2 001 0 324 Chairs 15Jan2007 1Jan2007  

方案中维表中前数版号始终0插入维度数时先老版数version_number改成1(递增)然插入前数时保持前数版号始终0  
事实表中插入数时维度数版号始终全部0  
方案完全解决事实表维表关系问题外优点保证事实表维表参完整性ERwinPowerDesigner等建模工具建模时Version_NumberSupplier_key作复合键两实体间建立链接  






MySQLTeradataPySpark代码互转表代码
代码描述
MySQL
Teradata SQL
PySpark
添加删列
1添加列
alter table [`<架构名称>`] `<表名>` add column <字段名> <类型>

2删列
alter table [`<架构名称>`] `<表名>` drop column <字段名>
1添加列
ALTER TABLE [<架构名称>]<表名> ADD <字段名> <类型>

2删列
ALTER TABLE [<架构名称>]<表名> DROP <字段名>
1添加列

withColumn('<字段名>' sum( [col] for col in columns))

2删列
drop('<字段名>')
删库
DROP DATABASE IF EXISTS] <库名>
DELETE DATABASE <库名> ALL
Parquet文件中:
import subprocess
 
subprocesscheck_call('rm r <存储路径>')shellTrue)

Hive表中:
from pysparksql import HiveContext
hive HiveContext(sparksparkContext)
hivesql('drop database if exists <库名> cascade')
删表
DROP TABLE [`<架构名称>`] `<表名>`
DROP TABLE [<架构名称>]<表名>
Parquet文件中:
import subprocess
 
subprocesscheck_call('rm r <存储路径><表名>')shellTrue)
 
Hive表中:
from pysparksql import HiveContext
hive HiveContext(sparksparkContext)
 
hivesql('drop table if exists [`<库名>`]`<表名>` purge')
清表中数
TUNCATE TABLE [`<架构名称>`] `<表名>`
DELETE [<架构名称>]<表名> ALL
Parquet文件中:
import subprocess
import pysparksqlfunctions as F
from pysparksqltypes import LongType
import copy

# 读取parquet文件数代码
df1 sparkreadload(
path'<存储路径><表名>'
format'parquet' headerTrue)

# 获取表结构
_schema copydeepcopy(df1schema)
df2 df1rddzipWithIndex()map(lambda l list(l[0]) + [l[1]])toDF(_schema)
subprocesscheck_call('rm r <存储路径><表名>')shellTrue)

# 写入空数集parquet文件
df2writeparquet(
path'<存储路径><表名>'
modeoverwrite)

Hive部表中:
from pysparksql import HiveContext
hive HiveContext(sparksparkContext)
hivesql('truncate table [<架构名称>]表名')
Hive外部表中:
from pysparksql import HiveContext
hive HiveContext(sparksparkContext)
hivesql('insert overwrite table [<架构名称>]表名 select * from [<架构名称>]表名 where 12')
复制表结构新表
CREATE TABLE [`<架构名称2>`] `<表名2>` LIKE [`<架构名称1>`] `<表名1>`

通show create table [`<架构名称1>`] `<表名1>`语句旧表创建命令列出需该命令拷贝出更改table名字变成[`<架构名称2>`] `<表名2>`建立完全样表
CREATE TABLE [<架构名称2>]<表名2> AS [<架构名称1>]<表名1> WITH NO DATA

通show table [<架构名称1>] <表名1>语句旧表创建命令列出需该命令拷贝出更改table名字变成[<架构名称2>] <表名2>建立完全样表
Parquet文件中:
import pysparksqlfunctions as F
from pysparksqltypes import LongType
import copy
 
# 读取parquet文件数代码
df1 sparkreadload(
path'<存储路径1><表名1>'
format'parquet' headerTrue)
 
# 获取表结构
_schema copydeepcopy(df1schema)
df2 df1rddzipWithIndex()map(lambda l list(l[0]) + [l[1]])toDF(_schema)
 
# 写入空数集parquet文件
df2writeparquet(
path'<存储路径2><表名2>'
modeoverwrite)
 
Hive表中:
CREATE TABLE [<架构名称2>]<表名2> LIKE [<架构名称1>]<表名1>

通desc formmated [<架构名称1>] <表名1>语句show create table [<架构名称1>] <表名1>语句旧表创建命令列出需该命令拷贝出更改表名字变成[<架构名称2>] <表名2>建立完全样表
创建表插入查询数
CREATE TABLE [`<架构名称>`] `<表名>` (
<字段名1> <类型1>[ AUTO_INCREMENT]
<字段名2> <类型2>[ AUTO_INCREMENT]
<字段名3> <类型3>[ AUTO_INCREMENT]

<字段名n> <类型3n>[ AUTO_INCREMENT] [
PRIMARY KEY (<键字段名>)][
UNIQUE (<唯值字段名1> <唯值字段名2><唯值字段名3>…<唯值字段名m>)]
) [ENGINE{InnoDB|MYISAM|BDB} DEFAULT CHARSET{utf8|gbk}]
 
INSERT INTO [`<架构名称>`]`<表名>`

 

 
CREATE TABLE [`<架构名称>`] `<表名>`
 
 
CREATE {MULTISET|SET} TABLE [<架构名称>]<表名>[
<参数1>
<参数2>
<参数3>

<参数n>]
(
<字段名1> <类型1> [CHARACTER SET <字符集1> NOT CASESPECIFIC]
<字段名2> <类型2> [CHARACTER SET <字符集2> NOT CASESPECIFIC]
<字段名3> <类型3> [CHARACTER SET <字符集3> NOT CASESPECIFIC]

<字段名n> <类型n> [CHARACTER SET <字符集3> NOT CASESPECIFIC]
)
[UNIQUE] [PRIMARY INDEX (<键字段名>)]
 
INSERT INTO [<架构名称>]<表名>

 

 
CREATE TABLE [<架构名称>]<表名> AS (

) WITH DATA


 
CREATE TABLE [<架构名称1>]<表名1> AS [<架构名称2>]<表名2> WITH DATA
Parquet文件中:
sparksql(
<查询语句>
)
writeparquet(
path'<存储路径><表名>'
modeoverwrite)

Hive表中:
部表
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql(
CREATE TABLE [`<库名>`]`<表名>`(
`<字段名1>` <类型1>
`<字段名2>` <类型2>
`<字段名3>` <类型3>

`<字段名n>` <类型n>)
[PARTITIONED BY (
`<分区字段1>` <分区字段类型1>
`<分区字段2>` <分区字段类型2>
`<分区字段3>` <分区字段类型3>

`<分区字段n>` <分区字段类型n>
)]
ROW FORMAT SERDE
'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
STORED AS INPUTFORMAT
'orgapachehadoophiveqlioparquetMapredParquetInputFormat'
OUTPUTFORMAT
'orgapachehadoophiveqlioparquetMapredParquetOutputFormat'
LOCATION
'hdfs<表名>')

hivesql(
CREATE TABLE [`<库名>`]`<表名>`(
`<字段名1>` <类型1>
`<字段名2>` <类型2>
`<字段名3>` <类型3>

`<字段名n>` <类型n>)
[PARTITIONED BY (
`<分区字段1>` <分区字段类型1>
`<分区字段2>` <分区字段类型2>
`<分区字段3>` <分区字段类型3>

`<分区字段n>` <分区字段类型n>
)]
ROW FORMAT SERDE
'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
STORED AS PARQUET)
 
外部表
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql(
CREATE EXTERNAL TABLE [IF NOT EXISTS] [`<库名>`]`<表名>`(
`<字段名1>` <类型1>
`<字段名2>` <类型2>
`<字段名3>` <类型3>

`<字段名n>` <类型n>)
[PARTITIONED BY (
`<分区字段1>` <分区字段类型1>
`<分区字段2>` <分区字段类型2>
`<分区字段3>` <分区字段类型3>

`<分区字段n>` <分区字段类型n>
)]
ROW FORMAT SERDE
'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
STORED AS INPUTFORMAT
'orgapachehadoophiveqlioparquetMapredParquetInputFormat'
OUTPUTFORMAT
'orgapachehadoophiveqlioparquetMapredParquetOutputFormat'
LOCATION
'hdfs<表名>')

hivesql(INSERT OVERWRITE TABLE [`<库名>`]`<表名>` <查询语句> )
插入少量数
INSERT INTO [`<架构名称>`]`<表名>` (<字段名1><字段名2><字段名3><字段名n>) VALUES
(<值1><值2><值3>…<值n>)
(<值n+1><值n+2><值n+3>…<值2n>)
(<值2n+1><值2n+2><值2n+3>…<值3n>)

(<值mn+1><值mn+2><值mn+3>…<值(m+1)n >)
INSERT INTO [<架构名称>]<表名>
(<字段名1><字段名2><字段名3><字段名n>)
SELECT *
FROM (SELECT *
FROM (SELECT <值1><值2><值3>…<值n>) t
UNION ALL
SELECT *
FROM (SELECT <值n+1><值n+2><值n+3>…<值2n>) t
UNION ALL
SELECT *
FROM (SELECT <值2n+1><值2n+2><值2n+3>…<值3n>) t

UNION ALL
SELECT *
FROM (SELECT <值mn+1><值mn+2><值mn+3>…<值(m+1)n>) t
) tt
PySpark文件中:
<表名>_df sparkcreateDataFrame([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>)(<值2n+1><值2n+2><值2n+3><值3n>)(<值mn+1><值mn+2><值mn+3><值(m+1)n>)]['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])
dfwriteparquet(
path'<存储路径><表名>[<分区字段><分区值>]')
modeoverwrite)

<表名>_df sparkparallelize([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>) (<值2n+1><值2n+2><值2n+3><值3n>)…(<值mn+1><值mn+2><值mn+3><值(m+1)n>)])toDF(['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])
<表名>_dfwriteparquet(
path'<存储路径><表名>[<分区字段><分区值>]')
modeoverwrite)

Hive表中:
INSERT INTO TABLE [<库名>]<表名> [PARTITION (<分区字段> '<分区值>')]
VALUES (<值1><值2><值3><值n>) (<值n+1><值n+2><值n+3><值2n>) (<值2n+1><值2n+2><值2n+3><值3n>)…(<值mn+1><值mn+2><值mn+3><值(m+1)n>)
限制查询返回行数
SELECT <字段列表>
FROM [`<架构名称1>`]`<表名>`
{INNERLEFTRIGHTFULL} JOIN [`<架构名称2>`]`<维度表名1>`
ON <表连接条件1>
{INNERLEFTRIGHTFULL} JOIN [`<架构名称3>`]`<度量表名1>`

ON <表连接条件2>
[WHERE <筛选条件>] LIMIT <限制返回行数>
SELECT TOP <限制返回行数> <字段列表>
FROM [<架构名称1>]<表名1>
{INNERLEFTRIGHTFULL} JOIN [<架构名称2>]<维度表名1>
ON <表连接条件1>
{INNERLEFTRIGHTFULL} JOIN [<架构名称3>]<度量表名1>

ON <表连接条件2>
[WHERE <筛选条件>]
sparksql(
SELECT * FROM [<架构名称1>]<表名1>
)
createOrReplaceTempView(<表名1>)

sparksql(
SELECT * FROM [<架构名称2>]<维度表名1>
)cache()
createOrReplaceTempView(<表名2>)

sparksql(
SELECT * FROM [<架构名称3>]<度量表名1>
)
createOrReplaceTempView(<表名3>)

sparksql(
SELECT <字段列表>
FROM <表名1>
JOIN <表名2> ON <表连接条件1>
{INNERLEFTRIGHTFULL} JOIN <表名3> ON <表连接条件2>
[WHERE <筛选条件>])limit(<限制返回行数>)
带表连接查询
SELECT <字段列表>
FROM [`<架构名称1>`]`<表名1>`
{INNERLEFTRIGHTFULL} JOIN [`<架构名称2>`]`<维度表名1>`
ON <表连接条件1>
{INNERLEFTRIGHTFULL} JOIN [`<架构名称3>`]`<度量表名1>`

ON <表连接条件2>
[WHERE <筛选条件>]
SELECT <字段列表>
FROM [<架构名称1>]<表名1>
{INNERLEFTRIGHTFULL} JOIN [<架构名称2>]<维度表名1>
ON <表连接条件1>
{INNERLEFTRIGHTFULL} JOIN [<架构名称3>]<度量表名1>

ON <表连接条件2>
[WHERE <筛选条件>]
sparksql(
SELECT * FROM [<架构名称1>]<表名1>
)
createOrReplaceTempView(<表名1>)

sparksql(
SELECT * FROM [<架构名称2>]<维度表名1>
)cache()
createOrReplaceTempView(<表名2>)

sparksql(
SELECT * FROM [<架构名称3>]<度量表名1>
)
createOrReplaceTempView(<表名3>)

sparksql(
SELECT <字段列表>
FROM <表名1>
JOIN <表名2> ON <表连接条件1>
{INNERLEFTRIGHTFULL} JOIN <表名3> ON <表连接条件2>
[WHERE <筛选条件>])
带表连接更新表记录
CREATE TABLE [`<架构名称1>`] `<表名1>` (
<字段名1> <类型1>[ AUTO_INCREMENT]
<字段名2> <类型2>[ AUTO_INCREMENT]
<字段名3> <类型3>[ AUTO_INCREMENT]

<字段名n> <类型3n>[ AUTO_INCREMENT] [
PRIMARY KEY (<键字段名>)][
UNIQUE (<唯值字段名1> <唯值字段名2><唯值字段名3>…<唯值字段名m>)]
) [ENGINE{InnoDB|MYISAM|BDB} DEFAULT CHARSET{utf8|gbk}]
 
INSERT INTO [`<架构名称1>`] `<表名1>`
SELECT <键字段>
<值变字段1>
<值变字段2>
<值变字段3>

<值变字段n>
<值改变字段1>
<值改变字段1>
<值改变字段2>

<值改变字段n>
FROM [`<架构名称2>`]`<表名2>`
WHERE <筛选条件>
 
 
UPDATE <名1>
FROM [`<架构名称1>`]`<表名1>` AS <名1>[`<架构名称3>``<表名3>`] SET
<值改变字段1><改变值1>
<值改变字段2><改变值2>
<值改变字段3><改变值3>

<值改变字段n><改变值n>
 
WHERE <表连接条件>
[AND <筛选条件>]
CREATE [MULTISET] TABLE [<架构名称1>]<表名1>[
<参数1>
<参数2>
<参数3>

<参数n>]
(
<字段名1> <类型1> [CHARACTER SET <字符集1> NOT CASESPECIFIC]
<字段名2> <类型2> [CHARACTER SET <字符集2> NOT CASESPECIFIC]
<字段名3> <类型3> [CHARACTER SET <字符集3> NOT CASESPECIFIC]

<字段名n> <类型3> [CHARACTER SET <字符集3> NOT CASESPECIFIC]
)
[UNIQUE] [PRIMARY INDEX (<键字段名>)]
 
INSERT INTO [<架构名称1>]<表名1>
SELECT <键字段>
<值变字段1>
<值变字段2>
<值变字段3>

<值变字段n>
<值改变字段1>
<值改变字段1>
<值改变字段2>

<值改变字段n>
FROM [<架构名称2>]<表名2>
WHERE <筛选条件>
 
 
UPDATE <名1>
FROM [<架构名称1>]<表名1> AS <名1>[<架构名称3><表名3>] SET
<值改变字段1><改变值1>
<值改变字段2><改变值2>
<值改变字段3><改变值3>

<值改变字段n><改变值n>
 
WHERE <表连接条件>
[AND <筛选条件>]
sparksql(
SELECT * FROM <架构名称2><表名2>
)
createOrReplaceTempView(<表名1>)

sparksql(
SELECT * FROM <架构名称3><表名3>
)
createOrReplaceTempView(<表名2>)

sparksql(
SELECT <名1><键字段>
<值变字段1>
<值变字段2>
<值变字段3>

<值变字段n>
if(<名2><键字段> is null <名1><值改变字段1> <改变值1>) AS <值改变字段1>
if(<名2><键字段> is null <名1><值改变字段2> <改变值2>) AS <值改变字段2>
if(<名2><键字段> is null <名1><值改变字段3> <改变值3>) AS <值改变字段3>

if(<名2><键字段> is null <名1><值改变字段n> <改变值n>) AS <值改变字段n>
FROM <表名1> AS <名1>
INNER JOIN <表名2> AS <名2>
ON <表连接条件>
[WHERE <筛选条件>])
writeparquet(
path'<存储路径><表名1>'
modeoverwrite)
合数
REPLACE INTO [`<架构名称>`] `<表名>` (<键字段名> <字段名1> <字段名2> <字段名3> … <字段名n>) VALUES (<键值> <值1> <值2> <值3> … <值n>)

LOAD DATA LOCAL INFILE '<存储路径><文件名>' REPLACE INTO TABLE [`<架构名称>`] `<表名>`

INSERT INTO [`<架构名称>`] `<表名>` (<键字段名><字段名1> <字段名2> <字段名3> … <字段名n>)
VALUES (<键值1> <值1> <值2> <值3> … <值n>) (<键值2> <值n+1> <值n+2> <值n+3> … <值2n>)
ON DUPLICATE KEY UPDATE <字段名1> VALUES(<字段名1>)<字段名2> VALUES(<字段名2>)<字段名3> VALUES(<字段名3>)…<字段名n> VALUES(<字段名n>)

insert into [`<架构名称>`] `<表名>`(<键字段名><字段名1> <字段名2> <字段名3> … <字段名n>) select * from dupnew on duplicate key update <字段名1> VALUES(<字段名1>)<字段名2> VALUES(<字段名2>)<字段名3> VALUES(<字段名3>)…<字段名n> VALUES(<字段名n>)

insert  ignore  into  [`<架构名称>`] `<表名>`(<键字段名><字段名1> <字段名2> <字段名3> … <字段名n>) values (<键值1> <值1> <值2> <值3> … <值n>)  

INSERT IGNORE INTO [`<架构名称>`] `<表名1>` SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n> FROM [`<架构名称>`] `<表名2>`

SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n> FROM [`<架构名称>`] `<表名1>` UNION DISTINCT SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n> FROM [`<架构名称>`] `<表名2>`

创建测试表
drop table test_a
create table test_a(
id VARCHAR (16)
name VARCHAR (16)
Operatime datetime
)
drop table test_b
create table test_b(
id VARCHAR (16)
name VARCHAR (16)
Operatime datetime
)

插入模拟数
INSERT into test_b values(11now())(22now())
INSERT into test_a values(11now())(33now())

查询数
SELECT * FROM test_b
SELECT * FROM test_a



delimiter
CREATE PROCEDURE merge_a_to_b () BEGIN
定义需插入a表插入b表程变量
DECLARE _ID VARCHAR (16)
DECLARE _NAME VARCHAR (16)
游标遍历数结束标志
DECLARE done INT DEFAULT FALSE
游标指a表结果集第条1位置
DECLARE cur_account CURSOR FOR SELECT ID NAME FROM test_a
游标指a表结果集条加1位置 设置结束标志
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done TRUE
开游标
OPEN cur_account
遍历游标
read_loop
LOOP
取值a表前位置数时变量
FETCH NEXT FROM cur_account INTO _ID_NAME

果取值结束 跳出循环
IF done THEN LEAVE read_loop
END IF

前数做 果b表存更新时间 存插入
IF NOT EXISTS ( SELECT 1 FROM TEST_B WHERE ID _ID AND NAME_NAME )
THEN
INSERT INTO TEST_B (ID NAMEoperatime) VALUES (_ID_NAMEnow())
ELSE
UPDATE TEST_B set operatime now() WHERE ID _ID AND NAME_NAME
END IF

END LOOP
CLOSE cur_account

END


delimiter
CREATE PROCEDURE merge_a_to_b () BEGIN
定义需插入a表插入b表程变量
DECLARE _ID VARCHAR (16)
DECLARE _NAME VARCHAR (16)
游标遍历数结束标志
DECLARE done INT DEFAULT FALSE
游标指a表结果集第条1位置
DECLARE cur_account CURSOR FOR SELECT ID NAME FROM test_a
游标指a表结果集条加1位置 设置结束标志
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done TRUE
开游标
OPEN cur_account
遍历游标
read_loop
LOOP
取值a表前位置数时变量
FETCH NEXT FROM cur_account INTO _ID_NAME

果取值结束 跳出循环
IF done THEN LEAVE read_loop
END IF

前数做 果b表存更新时间 存插入
IF NOT EXISTS ( SELECT 1 FROM TEST_B WHERE ID _ID AND NAME_NAME )
THEN
INSERT INTO TEST_B (ID NAMEoperatime) VALUES (_ID_NAMEnow())
ELSE
UPDATE TEST_B set operatime now() WHERE ID _ID AND NAME_NAME
END IF

END LOOP
CLOSE cur_account

END
merge into [<架构名称1>]<表名1> <名1>
using [<架构名称2>]<表名2> <名2>
on (<名1><连接字段名1> <名2><连接字段名2>)
when matched then
update set <字段名1> <名2><字段名1><字段名2> <名2><字段名3><字段名3> <名2><字段名3>…<字段名n> <名2><字段名n>
when not matched then
insert values(<名2><连接字段名2><名2><字段名3><名2><字段名3>…<字段名n> <名2><字段名n>)
py
<表名1>_df sparkreadload(
path'<存储路径1><表名1>'
format'parquet' headerTrue)

<表名2>_df sparkreadload(
path'<存储路径2><表名2>'
format'parquet' headerTrue)

<表名1>_dfcreateOrReplaceTempView(<表名1>)
<表名2>_dfcreateOrReplaceTempView(<表名2>)

<表名1>_merge_df sparksql(
SELECT ifnull(ODS<键字段名>STG<键字段名>) AS <键字段名>ifnull(ODS<字段名1>STG<字段名1>) AS <字段名1>ifnull(ODS<字段名2>STG<字段名2>) AS <字段名2>ifnull(ODS<字段名3>STG<字段名3>) AS <字段名3> …ifnull(ODS<字段名n>STG<字段名n>) AS <字段名n> FROM
(
SELECT <键字段名> <字段名1> <字段名2> <字段名3> … <字段名n>
FROM <表名2>
) STG
FULL JOIN <表名1> AS ODS ON STG<键字段名> ODS<键字段名>
)

<表名1>_merge_dfwriteparquet(
path'<存储路径1><表名1>_merge'
modeoverwrite)

py
<表名1>_merge_df sparkreadload(
path'<存储路径1><表名1>_merge'
format'parquet' headerTrue)

<表名1>_merge_dfwriteparquet(
path'<存储路径1><表名1>'
modeoverwrite)

Hive表中:
CREATE TABLE [`<库名>`]`<表名>` (
`<字段名1>` <类型1>
`<字段名2>` <类型2>
`<字段名3>` <类型3>

`<字段名n>` <类型n>
)
CLUSTERED BY (<键字段>) INTO <数字> buckets
ROW FORMAT DELIMITED FIELDS TERMINATED BY ''
STORED AS orc
TBLPROPERTIES('transactional''true')

MERGE INTO [`<库名>`]`<表名>` AS <名1>
USING (
<查询语句>
) AS <名2>
ON …
WHEN MATCHED
THEN
UPDATE
SET <字段名1> <值1>
<字段名2> <值2>
<字段名3> <值3>

<字段名n> <值n>
查询分组排名数
SELECT <字段1>
<字段2>
<字段3>

<字段n>
FROM (
SELECT <字段1>
<字段2>
<字段3>

<字段n>
ROW_NUMBER() OVER (
PARTITION BY <分组字段> ORDER BY <排序字段> [DESC]
) AS rn
FROM [`<架构名称>`] ` <表名> `
[WHERE <筛选条件>]
) t
WHERE rn 1
SELECT <字段1>
<字段2>
<字段3>

<字段n>
FROM [<架构名称>]<表名>
QUALIFY ROW_NUMBER() OVER(PARTITION BY <分组字段> ORDER BY <排序字段> [DESC]) 1
[WHERE <筛选条件>] 
<表名>_df sparksql(
SELECT * FROM [<架构名称>]<表名>
)
<表名>_dfcreateOrReplaceTempView(<表名>)
<表名>_unique_df sparksql(
SELECT <字段1>
<字段2>
<字段3>

<字段n>
FROM (
SELECT <字段1>
<字段2>
<字段3>

<字段n>
ROW_NUMBER() OVER (
PARTITION BY <分组字段> ORDER BY <排序字段> [DESC]
) AS rn
FROM <表名>
[WHERE <筛选条件>]
) t
WHERE rn 1)

Hive表中:
from pysparksql import HiveContext

hive HiveContext(sparksparkContext)
hivesql(
SELECT
<字段1>
<字段2>
<字段3>

<字段n>
FROM (
SELECT
<字段1>
<字段2>
<字段3>

<字段n>
ROW_NUMBER() OVER (
PARTITION BY <分组字段> ORDER BY <排序字段>
) AS rn
FROM [`<库名>`]`<表名>`
[WHERE <筛选条件>]
) t
WHERE rn 1)
字符串连接
SELECT CONCAT(<字符串变量字段常量1><字符串变量字段常量2>)
SELECT <字符串变量字段常量1> || <字符串变量字段常量2>
sparksql(
SELECT CONCAT(<字符串变量字段常量1><字符串变量字段常量2>))
查询分组里数字
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<聚合函数1>(<度量字段1>) AS <名1>
<聚合函数2>(<度量字段2>) AS <名2>
 
<聚合函数3>(<度量字段3>) AS <名3>

<聚合函数m>(<度量字段m>) AS <名m>
FROM [<架构名称>]<表名>
GROUP BY 123…n
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<聚合函数1>(<度量字段1>) AS <名1>
<聚合函数2>(<度量字段2>) AS <名2>
 
<聚合函数3>(<度量字段3>) AS <名3>

<聚合函数m>(<度量字段m>) AS <名m>
FROM [<架构名称>]<表名>
GROUP BY 123…n
sparksql(
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<聚合函数1>(<度量字段1>) AS <名1>
<聚合函数2>(<度量字段2>) AS <名2>
 
<聚合函数3>(<度量字段3>) AS <名3>

<聚合函数m>(<度量字段m>) AS <名m>
FROM [<架构名称>]<表名>
GROUP BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>)
 
DECIMAL类型转换
SELECT (CAST(<数值字段变量常量1> AS DECIMAL(382)) CAST(<数值字段变量常量2> AS DECIMAL(382)))
SELECT (CAST(<数值字段变量常量1> AS DECIMAL(382)) CAST(<数值字段变量常量2> AS DECIMAL(382)))
<变量> sparksql(
SELECT <数值字段变量常量1> * 100 <数值字段变量常量2> 100)
NULL值换
IFNULL(exp1exp2)

COALESCE(exp1exp2)
字段exp1NULL值时返回exp1否返回exp2
NULLIF(exp1exp2)

COALESCE(exp1exp2)
字段exp1NULL值时返回exp1否返回exp2
>>> df sparkcreateDataFrame([(1) (2) (3) (None)] ['col'])
>>> dfshow()
++
| col|
++
| 1|
| 2|
| 3|
|null|
++
>>> df dffillna({'col''4'})
>>> dfshow()
or dffillna({'col''4'})show()
++
|col|
++
| 1|
| 2|
| 3|
| 4|
++

sparksql(
SELECT IFNULL(exp1exp2) …
)
获取年月日获取中国时区天日期
SELECT YEAR(CURRENT_DATE()) MONTH(CURRENT_DATE()) DAY(CURRENT_DATE()) CONVERT_TZ(create_time @@sessiontime_zone'+800')
SET time_zone'AsiaShanghai'
select now()
SELECT EXTRACT(YEAR FROM CURRENT_DATE) EXTRACT(MONTH FROM CURRENT_DATE) EXTRACT(DAY FROM CURRENT_DATE) CAST(CONVERT_TIMEZONE('AsiaShanghai'CAST(GETDATE() AS TIMESTAMP)) AS DATE)
<变量> sparksql(
SELECT YEAR(CURRENT_DATE) MONTH(CURRENT_DATE) DAY(CURRENT_DATE) CAST(CONVERT_TIMEZONE('AsiaShanghai'CAST(GETDATE() AS TIMESTAMP)) AS DATE))
时间戳间间隔天数计算
SELECT TIMESTAMPDIFF(DAY <开始时间戳> <结束时间戳>)
SELECT EXTRACT(DAY FROM (<结束时间戳> <开始时间戳> DAY(4) TO SECOND)) * 86400
from pysparksqlfunctions import *

dates [(120190701 120119111)
(220190624 120119222)
(320191116 164455406)
(420191116 165059406)
]

df sparkcreateDataFrame(datadates schema[idinput_timestamp])

#Calculate Time difference in Seconds
<变量>dfwithColumn('from_timestamp'to_timestamp(col('from_timestamp')))\
withColumn('end_timestamp' current_timestamp())\
withColumn('DiffInDays'(col(end_timestamp)cast(long) col('from_timestamp')cast(long))246060)
<变量>show(truncateFalse)
列出表字段信息
查系统表
SELECT
ColumnId 字段键(建表序致)
DataBaseName属库
TableName表名
DefaultValue默认值        
ColumnName字段名
ColumnTitle字段名
ColumnType字段类型
ColumnLength字段长度
DecimalTotalDigits精度
DecimalFractionalDigits 标度
ColumnFormat 格式
FROM
DBCColumns
WHERE
DATABASENAME'<库名>'
AND TABLENAME'<表名>'
ORDER BY 1
查表结构        
SHOW TABLE [`<库名>`] `<表名>`
字段类型映射关系
 
字段类型
映射值
拼接规
CF
CHAR
a ASCII编码(LATIN) CHAR(长度)
b UNICODE编码 CHAR(长度2)
CV
VARCHAR
a ASCII编码(LATIN) VARCHAR(长度)
b UNICODE编码 VARCHAR(长度2)
D
DECIMAL
DECIMAL(精度标度)
DA
DATE
DATE FORMAT 格式’
I
INTEGER
INTEGER
I8
BIGINT
BIGINT

SELECT

COLUMN_NAME AS '字段名'

DATA_TYPE AS `数类型`

CHARACTER_MAXIMUM_LENGTH AS `字符长度`

NUMERIC_PRECISION AS `数字长度`

NUMERIC_SCALE AS `数位数`

IS_NULLABLE AS `否允许非空`

CASE WHEN EXTRA 'auto_increment' THEN 1 ELSE 0 END AS `否增`

COLUMN_DEFAULT AS `默认值`

COLUMN_COMMENT AS `备注`

FROM information_schemaCOLUMNS WHERE TABLE_SCHEMA'<库名>' AND TABLE_NAME'<表名>'
df…

dfschema

dfprintSchema()

for name dtype in dfdtypes
print(name dtype)
分区操作
查MySQL否支持分区
1MySQL56前版
show variables like 'partition'

2MySQL57
show plugins

 
二分区表分类限制
1分区表分类
RANGE分区:基属定连续区间列值行分配分区
 
LIST分区:类似RANGE分区区LIST分区基列值匹配离散值集合中某值进行选择
 
HASH分区:基户定义表达式返回值进行选择分区该表达式插入表中行列值进行计算函数包含MySQL 中效产生非负整数值表达式
 
KEY分区:类似HASH分区区KEY分区支持计算列列MySQL服务器提供身哈希函数必须列列包含整数值
 
复合分区:MySQL 56版中支持RANGELIST子分区子分区类型HASHKEY
 
2分区表限制
1)分区键必须包含表键唯键中
 
2)MYSQL分区函数列身进行较时滤分区根表达式值滤分区表达式分区函数行
 
3)分区数: NDB存储引擎定表分区数8192(包括子分区)果分区数未达8192时提示 Got error … from storage engine Out of resources when opening file通增加open_files_limit系统变量值解决问题然时开文件数量操作系统限制
 
4)支持查询缓存: 分区表支持查询缓存涉分区表查询动禁 查询缓存法启类查询
 
5)分区innodb表支持外键
 
6)服务器SQL_mode影响分区表步复制 机机SQL_mode会导致sql语句 导致分区间数分配定位置甚导致插入机成功分区表库失败 获佳效果您应该始终机机相服务器SQL模式
 
7)ALTER TABLE … ORDER BY: 分区表运行ALTER TABLE … ORDER BY列语句会导致分区中行排序
 
8)全文索引 分区表支持全文索引InnoDBMyISAM存储引擎分区表
9)分区表法外键约束
10)Spatial columns: 具空间数类型(POINTGEOMETRY)列分区表中
11)时表: 时表分区
12)subpartition问题: subpartition必须HASHKEY分区 RANGELIST分区分区 HASHKEY分区子分区
13)分区表支持mysqlcheckmyisamchkmyisampack

三创建分区表
1range分区
行数基定连续区间列值放入分区
CREATE TABLE `test_11` (
`id` int(11) NOT NULL
`t` date NOT NULL
PRIMARY KEY (`id``t`)
) ENGINEInnoDB DEFAULT CHARSETutf8
PARTITION BY RANGE (to_days(t))
(PARTITION p20170801 VALUES LESS THAN (736907) ENGINE InnoDB
PARTITION p20170901 VALUES LESS THAN (736938) ENGINE InnoDB
PARTITION pmax VALUES LESS THAN maxvalue ENGINE InnoDB)123456789
然插入4条数:
insert into test_11 values (120170722)(220170822)(320170823)(420170824)1
然查informationpartitions分区信息统计:
select PARTITION_NAME as 分区TABLE_ROWS as 行数 from information_schemapartitions where table_schemamysql_test and table_nametest_11
+++
| 分区 | 行数 |
+++
| p20170801 | 1 |
| p20170901 | 3 |
+++
2 rows in set (000 sec)12345678
出分区p20170801插入1行数p20170901插入3行数
yearto_daysunix_timestamp等函数相应时间字段进行转换然分区
2list分区
range分区样list分区面离散值
mysql> CREATE TABLE h2 (
> c1 INT
> c2 INT
> )
> PARTITION BY LIST(c1) (
> PARTITION p0 VALUES IN (1 4 7)
> PARTITION p1 VALUES IN (2 5 8)
> )
Query OK 0 rows affected (011 sec)123456789
RANGE分区情况没catchallMAXVALUE 分区表达式预期值应PARTITION … VALUES IN(…)子句中涵盖 包含匹配分区列值INSERT语句失败显示错误示例示:
mysql> INSERT INTO h2 VALUES (3 5)
ERROR 1525 (HY000) Table has no partition for value 312
3hash分区
根户定义表达式返回值进行分区返回值负数
CREATE TABLE t1 (col1 INT col2 CHAR(5) col3 DATE)
PARTITION BY HASH( YEAR(col3) )
PARTITIONS 4123
果插入col3数值’20050915’根计算选择插入分区:
MOD(YEAR('20050901')4)
MOD(20054)
1123
4key分区
根MySQL数库提供散列函数进行分区
CREATE TABLE k1 (
id INT NOT NULL
name VARCHAR(20)
UNIQUE KEY (id)
)
PARTITION BY KEY()
PARTITIONS 21234567
KEY仅列出零列名称 作分区键列必须包含表键部分全部果该表具 果没列名称作分区键表键(果)果没键唯键唯键分区键果唯键列未定义NOT NULL条语句失败
分区类型KEY分区限整数空值 例CREATE TABLE语句效:
CREATE TABLE tm1 (
s1 CHAR(32) PRIMARY KEY
)
PARTITION BY KEY(s1)
PARTITIONS 1012345
注意:key分区表执行ALTER TABLE DROP PRIMARY KEY样做会生成错误 ERROR 1466 (HY000) Field in list of fields for partition function not found in table
5Column分区
COLUMN分区55开始引入分区功RANGE COLUMNLIST COLUMN两种分区支持整形日期字符串RANGELIST分区方式非常相似
COLUMNSRANGELIST分区区
1)针日期字段分区需函数进行转换例针date字段进行分区需YEAR()表达式进行转换
2)COLUMN分区支持字段作分区键支持表达式作分区键
column支持数类型:
1)整型floatdecimal支持
2)日期类型:datedatetime支持
3)字符类型:CHAR VARCHAR BINARYVARBINARYblobtext支持
单列column range分区mysql> show create table list_c
CREATE TABLE `list_c` (
`c1` int(11) DEFAULT NULL
`c2` int(11) DEFAULT NULL
) ENGINEInnoDB DEFAULT CHARSETlatin1
*50500 PARTITION BY RANGE COLUMNS(c1)
(PARTITION p0 VALUES LESS THAN (5) ENGINE InnoDB
PARTITION p1 VALUES LESS THAN (10) ENGINE InnoDB) *
列column range分区mysql> show create table list_c
CREATE TABLE `list_c` (
`c1` int(11) DEFAULT NULL
`c2` int(11) DEFAULT NULL
`c3` char(20) DEFAULT NULL
) ENGINEInnoDB DEFAULT CHARSETlatin1
*50500 PARTITION BY RANGE COLUMNS(c1c3)
(PARTITION p0 VALUES LESS THAN (5'aaa') ENGINE InnoDB
PARTITION p1 VALUES LESS THAN (10'bbb') ENGINE InnoDB) *
单列column list分区mysql> show create table list_c
CREATE TABLE `list_c` (
`c1` int(11) DEFAULT NULL
`c2` int(11) DEFAULT NULL
`c3` char(20) DEFAULT NULL
) ENGINEInnoDB DEFAULT CHARSETlatin1
*50500 PARTITION BY LIST COLUMNS(c3)
(PARTITION p0 VALUES IN ('aaa') ENGINE InnoDB
PARTITION p1 VALUES IN ('bbb') ENGINE InnoDB) *
6子分区(组合分区)
分区基础进步分区时成复合分区
MySQL数库允许rangelist分区进行HASHKEY子分区例:
CREATE TABLE ts (id INT purchased DATE)
PARTITION BY RANGE( YEAR(purchased) )
SUBPARTITION BY HASH( TO_DAYS(purchased) )
SUBPARTITIONS 2 (
PARTITION p0 VALUES LESS THAN (1990)
PARTITION p1 VALUES LESS THAN (2000)
PARTITION p2 VALUES LESS THAN MAXVALUE
)
[root@mycat3 ~]# ll datamysql_data_3306mysql_testts*
rwr 1 mysql mysql 8596 Aug 8 1354 datamysql_data_3306mysql_testtsfrm
rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p0#SP#p0sp0ibd
rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p0#SP#p0sp1ibd
rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p1#SP#p1sp0ibd
rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p1#SP#p1sp1ibd
rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p2#SP#p2sp0ibd
rwr 1 mysql mysql 98304 Aug 8 1354 datamysql_data_3306mysql_testts#P#p2#SP#p2sp1ibd
1234567891011121314151617
ts表根purchased进行range分区然进行次hash分区形成3*2分区物理文件证实分区方式通subpartition语法显示指定子分区名称
注意:子分区数量必须相果分区表子分区已subpartition必须表明子分区名称subpartition子句必须包括子分区名字子分区名字必须致
外MyISAM表index directorydata direactory指定分区数索引目录innodb表说该存储引擎表空间动进行数索引理会忽略指定indexdata语法


四普通表转换分区表
1alter table table_name partition by命令重建分区表
 
alter table jxfp_data_bak PARTITION BY KEY(SH) PARTITIONS 8
 
五分区表操作
CREATE TABLE t1 (
id INT
year_col INT
)
PARTITION BY RANGE (year_col) (
PARTITION p0 VALUES LESS THAN (1991)
PARTITION p1 VALUES LESS THAN (1995)
PARTITION p2 VALUES LESS THAN (1999)
)
 
1ADD PARTITION (新增分区)
ALTER TABLE t1 ADD PARTITION (PARTITION p3 VALUES LESS THAN (2002))
 
2DROP PARTITION (删分区)
ALTER TABLE t1 DROP PARTITION p0 p1
 
3TRUNCATE PARTITION(截取分区)
ALTER TABLE t1 TRUNCATE PARTITION p0
 
ALTER TABLE t1 TRUNCATE PARTITION p1 p3
 
4COALESCE PARTITION(合分区)
CREATE TABLE t2 (
name VARCHAR (30)
started DATE
)
PARTITION BY HASH( YEAR(started) )
PARTITIONS 6
 
ALTER TABLE t2 COALESCE PARTITION 2
 
5REORGANIZE PARTITION(拆分重组分区)
1)拆分分区
 
ALTER TABLE table ALGORITHMINPLACE REORGANIZE PARTITION
 
ALTER TABLE employees ADD PARTITION (
PARTITION p5 VALUES LESS THAN (2010)
PARTITION p6 VALUES LESS THAN MAXVALUE
)
 
2)重组分区
 
ALTER TABLE members REORGANIZE PARTITION s0s1 INTO (
PARTITION p0 VALUES LESS THAN (1970)
)
ALTER TABLE tbl_name
REORGANIZE PARTITION partition_list
INTO (partition_definitions)
ALTER TABLE members REORGANIZE PARTITION p0p1p2p3 INTO (
PARTITION m0 VALUES LESS THAN (1980)
PARTITION m1 VALUES LESS THAN (2000)
)
ALTER TABLE tt ADD PARTITION (PARTITION np VALUES IN (4 8))
ALTER TABLE tt REORGANIZE PARTITION p1np INTO (
PARTITION p1 VALUES IN (6 18)
PARTITION np VALUES in (4 8 12)
)
 
6ANALYZE CHECK PARTITION(分析检查分区)
1)ANALYZE 读取存储分区中值分布情况
 
ALTER TABLE t1 ANALYZE PARTITION p1 ANALYZE PARTITION p2
 
ALTER TABLE t1 ANALYZE PARTITION p1 p2
 
2)CHECK 检查分区否存错误
 
ALTER TABLE t1 ANALYZE PARTITION p1 CHECK PARTITION p2
 
7REPAIR分区
修复破坏分区
 
ALTER TABLE t1 REPAIR PARTITION p0p1
 
8OPTIMIZE
该命令回收空闲空间分区碎片整理分区执行该命令相次分区执行 CHECK PARTITION ANALYZE PARTITIONREPAIR PARTITION命令
 

 
ALTER TABLE t1 OPTIMIZE PARTITION p0 p1
 
9REBUILD分区
重建分区相先删分区中数然重新插入分区碎片整理
 
ALTER TABLE t1 REBUILD PARTITION p0 p1
 
10EXCHANGE PARTITION(分区交换)
分区交换语法
 
ALTER TABLE pt EXCHANGE PARTITION p WITH TABLE nt
 
中pt分区表ppt分区(注子分区)nt目标表
 
实分区交换限制蛮
 
1) nt分区表
 
2)nt时表
 
3)ntpt结构必须致
 
4)nt存外键约束键外键
 
5)nt中数位p分区范围外
 
具体参考MySQL官方文档
 
11迁移分区(DISCARD IMPORT )
ALTER TABLE t1 DISCARD PARTITION p2 p3 TABLESPACE
 
ALTER TABLE t1 IMPORT PARTITION p2 p3 TABLESPACE
 
实验环境:(mysql57)
源库:1921682200 mysql5716 zhangdbemp_2分区表
目标库:1921682100 mysql5718 test (zhangdbemp表导入目标库test schema)
:源数库中创建测试分区表emp_2然导入数
MySQL [zhangdb]> CREATE TABLE emp_2(
id BIGINT unsigned NOT NULL AUTO_INCREMENT
x VARCHAR(500) NOT NULL
y VARCHAR(500) NOT NULL
PRIMARY KEY(id)
)
PARTITION BY RANGE COLUMNS(id)
(
PARTITION p1 VALUES LESS THAN (1000)
PARTITION p2 VALUES LESS THAN (2000)
PARTITION p3 VALUES LESS THAN (3000)
)
(接着创建存储程导入测试数)
DELIMITER
CREATE PROCEDURE insert_batch()
begin
DECLARE num INT
SET num1
WHILE num < 3000 DO
IF (num100000) THEN
COMMIT
END IF
INSERT INTO emp_2 VALUES(NULL REPEAT('X' 500) REPEAT('Y' 500))
SET numnum+1
END WHILE
COMMIT
END
DELIMITER
mysql> select TABLE_NAMEPARTITION_NAME from information_schemapartitions where table_schema'zhangdb'
+++
| TABLE_NAME | PARTITION_NAME |
+++
| emp | NULL |
| emp_2 | p1 |
| emp_2 | p2 |
| emp_2 | p3 |
+++
4 rows in set (000 sec)
mysql> select count(*) from emp_2 partition (p1)
++
| count(*) |
++
| 999 |
++
1 row in set (000 sec)
mysql> select count(*) from emp_2 partition (p2)
++
| count(*) |
++
| 1000 |
++
1 row in set (000 sec)
mysql> select count(*) from emp_2 partition (p3)
++
| count(*) |
++
| 1000 |
++
1 row in set (000 sec)
面出emp_2分区表已创建完成3子分区分区点数
:目标数库中创建emp_2表结构数(源库show create table emp_2\G 方法 查创建该表sql)
MySQL [test]> CREATE TABLE `emp_2` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
`x` varchar(500) NOT NULL
`y` varchar(500) NOT NULL
PRIMARY KEY (`id`)
) ENGINEInnoDB AUTO_INCREMENT3000 DEFAULT CHARSETutf8mb4
*50500 PARTITION BY RANGE COLUMNS(id)
(PARTITION p1 VALUES LESS THAN (1000) ENGINE InnoDB
PARTITION p2 VALUES LESS THAN (2000) ENGINE InnoDB
PARTITION p3 VALUES LESS THAN (3000) ENGINE InnoDB) *
[root@localhost test]# ll
rwr 1 mysql mysql 98304 May 25 1558 emp_2#P#p0ibd
rwr 1 mysql mysql 98304 May 25 1558 emp_2#P#p1ibd
rwr 1 mysql mysql 98304 May 25 1558 emp_2#P#p2ibd
注意:
※约束条件字符集等等必须致建议show create table t1 获取创建表SQL否新服务器导入表空间时候会提示1808错误
:目标数库丢弃分区表表空间
MySQL [test]> alter table emp_2 discard tablespace
Query OK 0 rows affected (012 sec)
[root@localhost test]# ll 时候刚3分区idb文件没
rwr 1 mysql mysql 8604 May 25 0414 emp_2frm
:源数库运行FLUSH TABLES … FOR EXPORT 锁定表生成cfg元数文件cfgibd文件传输目标数库中
mysql> flush tables emp_2 for export
Query OK 0 rows affected (000 sec)
[root@localhost zhangdb]# scp emp_2* root@1921682100mysqldatatest 文件cp目标数库
mysql> unlock tables 表锁否
:目标数库中文件授权然导入表空间查数否完整
[root@localhost test]# chown mysqlmysql emp_2#*
MySQL [test]> alter table emp_2 import tablespace
Query OK 0 rows affected (096 sec)
MySQL [test]> select count(*) from emp_2
++
| count(*) |
++
| 2999 |
++
1 row in set (063 sec)
面查知分区表已导入目标数库中
外部分子分区导入目标数库中(整分区表会需子分区导入目标数库中)
部分子分区导入目标数库方法:
1)创建目标表时候需创建导入分区: 创建p2 p3两分区
CREATE TABLE `emp_2` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
`x` varchar(500) NOT NULL
`y` varchar(500) NOT NULL
PRIMARY KEY (`id`)
) ENGINEInnoDB AUTO_INCREMENT3000 DEFAULT CHARSETutf8mb4
*50500 PARTITION BY RANGE COLUMNS(id)
(
PARTITION p2 VALUES LESS THAN (2000) ENGINE InnoDB
PARTITION p3 VALUES LESS THAN (3000) ENGINE InnoDB) *
2)源库cp目标库文件然俩需分区
3)操作方法样
 
六获取分区相关信息
1 通 SHOW CREATE TABLE 语句查分区表分区子句
譬mysql> show create table eG
 
2 通 SHOW TABLE STATUS 语句查表否分区应Create_options字段

mysql> show table statusG
 
*************************** 1 row ***************************
 
Name e Engine InnoDB Version 10 Row_format Compact Rows 6 Avg_row_length 10922 Data_length 65536Max_data_length 0 Index_length 0 Data_free 0 Auto_increment NULL Create_time 20151207 222606 Update_time NULL Check_time NULL Collation latin1_swedish_ci Checksum NULL Create_options partitioned Comment
 
3 查 INFORMATION_SCHEMAPARTITIONS表
4 通 EXPLAIN PARTITIONS SELECT 语句查具体SELECT语句会访问分区
 
七MySQL57partition表改进
HANDLER statements:MySQL 571分区表开始支持HANDLER语句
index condition pushdown:MySQL573分区表开始支持ICP
load data:MySQL57开始缓存实现性提升分区130KB缓区实现点
Perpartition索引缓存:MySQL57开始支持CACHE INDEXLOAD INDEX INTO CACHE语句分区MyISAM表支持索引缓存
FOR EXPORT选项(FLUSH TABLES)MySQL 574分区InnoDB表开始支持FLUSH TABLES语句FOR EXPORT选项
MySQL 572开始子分区支持ANALYZECHECKOPTIMIZEREPAIRTRUNCATE操作
Teradata分区中常时间分区例添加create table语句末尾实现2013年全年天分区(省事次分510年):
 
PARTITION BY RANGE_N(
 
Rcd_Dt BETWEEN DATE '20130101' AND DATE '20131231'
 
EACH INTERVAL '1' DAY NO RANGE
 
)
 
外常(容易掌握)字符串取值分区述时间分区中RANGE_N关键字值分区采CASE_N关键字例示:
 
PARTITION BY CASE_N(
 
(CASE WHEN (my_field'A') THEN (1) ELSE (0) END)1
 
(CASE WHEN (my_field'B') THEN (2) ELSE (0) END)2
 
(CASE WHEN (my_field'C') THEN (3) ELSE (0) END)3
 
NO CASE OR UNKNOWN)
 
更进步中面语法元素:
 
my_field'A'
 
修改类似样形式:
 
SUBSTR(my_field11) IN ('E''F''G')
 
现实中访问数全表扫描变成分区扫描原某步骤达成10100倍性提升复杂耗时较长作业总够缩短半运行时间
 
1数分区
 
分布式程序中通信代价较通数集节点间分区进行控制获较少网络传输提升整体性果定RDD需扫描次完全没必预先进行处理数集次诸连接种基键操作中时分区会帮助
 
Spark法显示控制键具体落工作节点Spark确保组键出现节点
 
Join操作例果未根RDD中键重新分默认情况连接操作会两数集中键哈希值求出哈希值相记录通网络传输台机器然台机器键相记录进行连接操作
 
2partitionBy()算子
from pyspark import SparkContextSparkConf
if __name__ '__main__'
conf SparkConf()setMaster(local)setAppName(word_count)
scSparkContext(confconf)
pair_rddscparallelize([('a'1)('b'10)('c'4)('d'7)('e'9)('f'10)])
rdd_1pair_rddpartitionBy(2partitionFunclambda xord(x)2)persist()
print(rdd_1glom()collect())
 
结果:
rdd_1
[[('b' 10) ('d' 7) ('f' 10)] [('a' 1) ('c' 4) ('e' 9)]]
parittionBy()pairRDDpairRDD中key传入partitionFunc函数中需注意果partitonBy()操作结果持久化面次RDD时会重复数进行分区操作样话partitionBy()重新分区带处会抵消通算子完成python中定义分区scala语言中样麻烦(spark身提供HashPartitioner RangePartitioner)
 
3影响分区方式操作
 
算子会生成结果RDD设分区方式:
 
cogroup()groupWith()join()leftOuterJoin()rightOuterJoin()groupByKey()reduceByKey()combineByKey()partitionBy()sort()mapValues(果父RDD分区方式)flatMapValues(果父RDD分区方式)filter(果父RDD分区方式)
 
写入分区
 <变量> sparksql(
<查询语句>
)
<变量>writeparquet(
path'<存储路径><表名>{par_col}{par_val}'format(par_col'<分区列名>'par_val'<分区值>')
modeoverwrite)
 
删分区
Parquet文件中:
import subprocess
 
subprocesscheck_call('rm r <存储路径><表名>{par_col}{par_val}'format(par_col'<分区列名>'par_val'<分区值>')shellTrue)
 
Hive表中:
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql('alter table [`<库名>`]`<表名>` drop [if exists] partition(<分区列名>'<分区值>')')
 
查询分区:
Hive表中:
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql('show partitions `[<库名>]<表名>`')
 
显示分区HDFS存储路径
Parquet文件中:
import subprocess
 
subprocesscheck_call('hdfs dfs ls hdfs<表名>')shellTrue)
 
Hive表中:
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql(
DESC FORMATTED [`<库名>`]`<表名>` partition (<分区列名>'<分区值>'))

hivesql(
DESC EXTENDED [`<库名>`]`<表名>` partition (<分区列名>'<分区值>'))

hivesql(
USE `<库名>`
SHOW TABLE EXTENDED LIKE `<表名>` PARTITION(<分区列名>'<分区值>'))
 
添加分区加载分区数:
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql(
alter table [`<库名>`]`<表名>` add partition (<分区列名>'<分区值>') location 'hdfs<表名>') #改变源数存储位置
hivesql(
load data inpath '<存储路径><表名>[<分区列名><分区值>]' [overwrite] into table [`<库名>`]`<表名>` partition(<分区列名>'<分区值>')) #会源数切hive表指定路径
 
修改表分区名称
ALTER TABLE [`<库名>`]`<表名>` PARTITION ((<分区列名>'<分区值1>') RENAME TO PARTITION ((<分区列名>'<分区值2>')
 
修复Hive表元数分区(般加表创建语句分区操作语句面)
from pysparksql import HiveContext
 
hive HiveContext(sparksparkContext)
hivesql(
MSCK REPAIR TABLE [`<库名>`]`<表名>`)

Shell命令提交运行脚
vim [<存储路径>]<文件名>sql

mysql [h ] [u <账号>] [p<密码>] [<数库名>] < [<存储路径>]<文件名>sql

mysql [h ] [u <账号>] [p<密码>]
mysql> use <数库名>
mysql> source [<存储路径>]<文件名>sql
mysql> QUIT
文件容:
#binbash
usrbinbteq  
 LOGON <账号><密码>
 <查询语句数操作语句1>
<查询语句数操作语句2>
<查询语句数操作语句3>

<查询语句数操作语句n>
 IF ERRORCODE <> 0 THEN QUIT ERRORCODE
 LOGOFF  
QUIT
 

运行方式:
sh [<存储路径>]<文件名>sh
nohup sh <文件名>sh & #台运行登录断开中断运行
sh <文件名>sh #前台运行登录断开中断运行



文件容:
vim [<存储路径>]<文件名>sh
script_path<脚存储路径>
current_data(date +Ymd)
current_time(date +HMS)
scripts_sql{script_path}sql
sql_file_content`cat {scripts_sql}1`
log_path{scripts_sql}log
log_file{log_path}1{current_time}log
tdserver
dbuser<账号>
dbpass<密码>
dbinfotdserverdbuserdbpass

bteq << END >> log_file 2>&1
LOGON dbinfo
MAXERROR 1

sql_file_content

LOGOFF
END

if [[ ne 0 ]] then
echo e Betq executed 1 failed \n Please check the log in \n {log_file}
fi

RETCODE
if [ {RETCODE} 0 ]then
echo Please check the error log file log_file
exit 1
else
echo Query executed successfully
fi

cat log_file

运行方式:
nohup sh <文件名>sh sql & #台运行登录断开中断运行
sh <文件名>sh sql #前台运行登录断开中断运行
sparkenvsh
vim HADOOP_HOMEconfsparkenvsh

vim [<存储路径>]sh
SCRIPT_PATH`dirname 0`

SCRIPT_FILENAME`basename 0`
SCRIPT_PATH`dirname 0`
SCRIPT_NAME{SCRIPT_FILENAME*}
SCRIPT_FULL_PATH(readlink f 0)
SCRIPT_ROOT_DIR(dirname {SCRIPT_FULL_PATH})

SCALA_VERSION
SCALA_PATH
SCALA_HOME{SCALA_PATH}scala{SCALA_VERSION}
SPARK_HOME{SPARK_HOME}
SPARK_CONF_DIR
HADOOP_CONF_DIRHADOOP_HOMEetchadoop
HADOOP_VERSION
HBASE_CONF_DIR
STAGE_DIRhdfs
QUEUE{QUEUE}
JAR_DIR{JAR_DIRSCRIPT_ROOT_DIRlib}
LOG_DIRSCRIPT_ROOT_DIRlogs
CONF_DIR{CONF_DIRSCRIPT_ROOT_DIRconf}
EMAIL_SERVER{EMAIL_SERVER<邮件服务器域名IP>}
EMAIL_FROM{EMAIL_FROM<发件邮箱址>}
EMAIL_TO{EMAIL_TO<收件邮箱址>}
TODAY`date +Ymd`
THIS_HOUR`date +HM`
YEAR`date d {TODAY} +Y`
MONTH`date d {TODAY} +m`
DAY`date d {TODAY} +d`
ALERT_EMAILS<通知邮件收件邮箱>
SPARK_DRIVER_CORE{SPARK_DRIVER_CORE2}
SPARK_DRIVER_MEMORY{SPARK_DRIVER_MEMORY8G}
SPARK_EXECUTOR_MEMORY{SPARK_EXECUTOR_MEMORY16G}
SPARK_EXECUTOR_CORE{SPARK_EXECUTOR_CORE2}
SPARK_DEFAULT_PARALLELISM{SPARK_EXECUTOR_CORE150}
SPARK_YARN_TAGS{SPARK_YARN_TAGSLLAMASLAtrueproject_namegalaxi}
EXECUTOR_MEMORY_OVERHEAD{EXECUTOR_MEMORY_OVERHEAD8192}
DRIVER_MEMORY_OVERHEAD{DRIVER_MEMORY_OVERHEAD1024}
BROADCAST_JOIN_THRESHOLD{BROADCAST_JOIN_THRESHOLD104857600}
SHUFFLE_PARTITIONS{SHUFFLE_PARTITIONS6001}
SPARK_DYNAMICALLOCATION_ENABLED{SPARK_DYNAMICALLOCATION_ENABLEDtrue}
SPARK_DYNAMICALLOCATION_MINEXECUTORS{SPARK_DYNAMICALLOCATION_MINEXECUTORS10}
SPARK_DYNAMICALLOCATION_MAXEXECUTORS{SPARK_DYNAMICALLOCATION_MAXEXECUTORS500}
NUM_EXECUTORS{NUM_EXECUTORS{SPARK_DYNAMICALLOCATION_MINEXECUTORS}}
SPARK_MEMORY_FRACTION{SPARK_MEMORY_FRACTION06}
SPARK_SHUFFLE_SORT_BYPASSMERGETHRESHOLD{SPARK_SHUFFLE_SORT_BYPASSMERGETHRESHOLD200}
SQL_OUTPUT_PARTITIONS{SQL_OUTPUT_PARTITIONS200}
SPARK_SHUFFLE_SERVICE_ENABLED{SPARK_DYNAMICALLOCATION_ENABLED}

umask 000

input

while [ 1 ] do
if [ 1 i ] then
shift
# readlinklinux找出符号链接指位置
file(readlink f 1)
export full_name(dirname {file})
export files`find {full_name} regex *\\(py) | paste sd `
PY_NAME`basename 1`
input1
fi
shift
done

PY_NAME`echo 1 | grep o '^[^\]*'`
CUR_PATH(cd (dirname 0)pwd)
echo CUR_PATH

DEPLOY_MODE{{cluster|client}}
JOB_NAMEPY{PY_NAME}{USER}_`date +s`

{SPARK_HOME}binsparksubmit \
master yarn \
deploymode cluster \
name {JOB_NAME} \
queue {QUEUE} \
executormemory {SPARK_EXECUTOR_MEMORY} \
executorcores {SPARK_EXECUTOR_CORE} \
conf sparkdriverextraJavaOptionsDhdpversion{HADOOP_VERSION} Dhadoop{HADOOP_CONF_DIR} Dlog4jconfigurationlog4jproperties DLOG_DIR{LOG_DIR} DJOB_NAME{JOB_NAME} DEMAIL_SERVER{EMAIL_SERVER} DEMAIL_FROM{EMAIL_FROM} DEMAIL_TO{EMAIL_TO} \
conf sparkexecutorextraJavaOptionsDhdpversion{HADOOP_VERSION} XX+PrintGCDateStamps XX+PrintFlagsFinal XX+PrintGCDetails XX+PrintGC XX+PrintGCTimeStamps \
conf sparkyarnamextraJavaOptionsDhdpversion{HADOOP_VERSION} \
conf sparksqlautoBroadcastJoinThreshold{BROADCAST_JOIN_THRESHOLD} \
conf sparkdrivermemory{SPARK_DRIVER_MEMORY} \
conf sparkdrivercores{SPARK_DRIVER_CORE} \
conf sparkdriverextraClassPathusrhdpcurrenthadoopclientlibsnappy*jar \
conf sparkdriverextraLibraryPathusrhdpcurrenthadoopclientlibnative \
conf sparkexecutorextraLibraryPathusrhdpcurrenthadoopclientlibnative \
conf sparkyarndrivermemoryOverhead{DRIVER_MEMORY_OVERHEAD} \
conf sparkyarnexecutormemoryOverhead{EXECUTOR_MEMORY_OVERHEAD} \
conf sparkyarnmaxAppAttempts1 \
conf sparkshuffleiopreferDirectBufsfalse \
conf sparkdrivermaxResultSize{SPARK_DRIVER_MEMORY} \
conf sparktaskmaxFailures10 \
conf sparknetworktimeout600s \
conf sparksqlshufflepartitions{SHUFFLE_PARTITIONS} \
conf sparkyarnstagingDir{STAGE_DIR} \
conf sparkhadoopyarntimelineserviceenabledfalse \
conf sparkdynamicAllocationenabled{SPARK_DYNAMICALLOCATION_ENABLED} \
conf sparkdynamicAllocationminExecutors{SPARK_DYNAMICALLOCATION_MINEXECUTORS} \
conf sparkdynamicAllocationmaxExecutors{SPARK_DYNAMICALLOCATION_MAXEXECUTORS} \
conf sparkdynamicAllocationexecutorIdleTimeout3600s \
conf sparkdynamicAllocationschedulerBacklogTimeout600s \
conf sparkyarntagsLLAMASLAtrueproject_name<项目名称> \
numexecutors {NUM_EXECUTORS} \
conf sparkmemoryfraction{SPARK_MEMORY_FRACTION} \
conf sparkshufflesortbypassMergeThreshold{SPARK_SHUFFLE_SORT_BYPASSMERGETHRESHOLD} \
conf sqloutputpartitions{SQL_OUTPUT_PARTITIONS} \
conf sparkshuffleserviceenabled{SPARK_SHUFFLE_SERVICE_ENABLED} \
conf sparkyarnaccesshadoopFileSystemshdfs
files {files}{HADOOP_CONF_DIR}hdfssitexml{SPARK_CONF_DIR}hivesitexml{CONF_DIR}zookeeperproperties{CONF_DIR}dragonkeytab{CONF_DIR}graphjson{CONF_DIR}tablejson \
[principal <账号>@<密钥分发中心 KDC> \
keytab <存储路径><认证文件>keytab \]
archives <存储路径><包Python虚拟环境>zip \
conf sparkyarnappMasterEnvPYSPARK_PYTHONusrbinpython3 \
input

运行Shell:
sh [<存储路径>]sh i py

Shell脚容:
vim [<存储路径>]sh

for i in @
do
case i in
i*|input*)
INPUT{i#*}
shift # past argumentvalue

*)

esac
done

echo input sql path{INPUT}

commandsh [<存储路径>]sh i {INPUT}
echo {command}
eval command 2
res
if [ {res}X 0X ]then
echo INFO Run PySpark file successfully
else
echo ERROR Run PySpark failed and check log for detail please
fi
return res

运行Shell:
sh [<存储路径>]sh ipy

readlink:

readlinklinux找出符号链接指位置
例1:
readlink f usrbinawk
结果:
usrbingawk #usrbinawk软连接指gawk
例2:
readlink f homesoftwarelog
homesoftwarelog #果没链接显示身绝路径

获取前脚路径:
pathsh
#binbash
path(cd `dirname 0`pwd)
echo path
path2(dirname 0)
echo path2
前脚存路径:homesoftware
sh pathsh
homesoftware

解释:
dirname 0 获取前脚相路径
cd `dirname 0`pwd 先cd前路径然pwd印成绝路径

方法二:
pathsh
#binbash
path(dirname 0)
path2(readlink f path)
echo path2
sh pathsh
homesoftware
解释:
readlink f path 果path没链接显示身绝路径
获取路径较
pathsh
#binbash
PATH1(dirname 0)
PATH2(cd `dirname 0`pwd)
PATH3(readlink f PATH1)
echo PATH1
echo PATH2
echo PATH3
前脚存路径:homesoftware
sh pathsh
echo PATH1
homesoftware echo PATH2
home echo PATH3
Shell命令行交互式运行代码
mysql [h ] [u <账号>] [p<密码>] [<数库名>]
mysql><查询语句数操作语句1>

mysql><查询语句数操作语句2>

mysql><查询语句数操作语句3>



mysql><查询语句数操作语句n>
mysql>QUIT
usrbinbteq  c UTF8
LOGON <账号><密码>
 
<查询语句数操作语句1>

<查询语句数操作语句2>

<查询语句数操作语句3>



<查询语句数操作语句n>
 
IF ERRORCODE <> 0 THEN QUIT ERRORCODE
 
LOGOFF  
QUIT
 


BTEQBasic Teradata QueryTeradata发行提交SQL查询前端工具BTEQ命令必须开头结尾者什

BTEQ常报表格式化输出设置:
SET DEFAULTS:输出格式定义成默认值
SET ECHOREQ ONOFF:否SQL请求BTEQ命令复制输出报表中
SET FOLDLINE ON 1:第1字段显示第1行字段值显示第2行
SET FOOTING [NULL'string']:定义脚注包含&DATE&TIME&PAGE&n
SET FORMAT ONOFF:设置OFF时BTEQ忽略FOOTINGFORMCHARRTITLEHEADINGPAGEBREAK等设置
SET HEADING [NULL'string']:定义页头FOOTING样
SET NULL AS 'string':NULL默认值问号改变缺省值
SET OMIT ONOFF [nALL]:指定字段包括报表中页头脚注中
SET PAGEBREAK ONOFF [nALL]:指定字段值发生变化时插入分页符开始新页
SET PAGELENGTH n:定义页面长度默认值55行
SET RTITLE ['string']:定义页方标题动包含日期页号
SET SEPARATOR ['string'n]:定义字段间分隔符n表示空格
SET SUPPRESS ONOFF [nALL]:指定字段果遇连续重复值空格代
SET SKIPLINE ONOFF [nALL]:指定字段值发生变化时插入空行
SET SKIPDOUBLE ONOFF [nALL]:指定字段值发生变化时插入两空行
SET UNDERLINE ONOFF [nALL]:行输出中指定字段加划线
SET WIDTH n:设置报表宽度默认值75


假定Teradata数库DEMO(名字必须HOSTS文件中进行定义)SQL01户名进行登录键入logon demosql01退出Teradata命令logoff退出BTEQ命令quit果需BTEQ中运行unix命令必须运行os xxxx

BTEQ交互方式运行批处理方式运行BTEQ输出保存文件中重新恢复标准输出:export filexxxxexport reset


编写BTEQ脚时插入行进行注释采单行进行注释:
SET SESSION TRANSACTION ANSI
脚文件中必须户密码logon命令中提供
LOGIN sql01sql01
SELECT FROM WHERE
QUIT
保存脚testScript通run filetestScript进行脚运行
binpyspark
>>>

>>>

>>>



>>>
>>> exit()

hive
hive>

hive>

hive>


hive>

hive> exit

导入导出CSV文件
导入:
LOAD DATA INFILE '<存储路径>\<文件名>csv' INTO TABLE [`<库名>`] `<表名>` FIELDS TERMINATED BY '' OPTIONALLY ENCLOSED BY '' LINES TERMINATED BY '\n'
常参数:
FIELDS TERMINATED BY '':指定字段分隔符
OPTIONALLY ENCLOSED BY '':认双引号中独立字段Excel 转 CSV 时特殊字符(逗号顿号等)字段会动双引号引起
LINES TERMINATED BY '\n':指定行分隔符注意 Windows 台创建文件分隔符 '\r\n'

导出:
mysql > SELECT <字段名1><字段名2><字段名3><字段名n> INTO OUTFILE '<存储路径>\<文件名>csv'
FIELDS TERMINATED BY '' OPTIONALLY ENCLOSED BY ''
LINES TERMINATED BY '\n'
FROM [`<库名>`] `<表名>`
[筛选条件]
导入:
编辑文件:
vim <文件名>in
SET width 64000
SET session transaction btet
logmech ldap
logon <户名><密码>

DATABASE <库名>

PACK 1000
IMPORT VARTEXT '' FILE<存储路径>\<文件名>csv
REPEAT *
USING(<字段名1> <类型1>
<字段名2> <类型2>
<字段名3> <类型3>

<字段名n> <类型n>)

insert into <库名><表名> (
<字段名1>
<字段名2>
<字段名3>

<字段名n>
)
values
( <字段名1>
<字段名2>
<字段名3>

<字段名n>
)
LOGOFF
EXIT

执行文件:
binbteq < <文件名>in

导出:
vim <文件名>out
SET SESSION TRANSACTION BTET
LOGON <户名><密码>
EXPORT FILE <存储路径>\<文件名>csv
SET SEPARATOR ''
DATABASE <库名>
SELECT * FROM <表名>
[筛选条件]
 
LOGOFF
EXIT

执行文件:
binbteq < <文件名>out
导入:
df sparkreadload(
path'
format'csv' headerTrue)


导出:
df…
dfrepartition(1)writecsv(path'<存储路径><表名>[<分区字段><分区值>]csv' headerTrue sep mode'overwrite')

dfwriteformat(comdatabrickssparkcsv’)save(<存储路径><表名>[<分区字段><分区值>]csv’)

dftoPandas()to_csv(<存储路径><表名>[<分区字段><分区值>]csv’)

index否索引header否列名True需
outputpath<存储路径><表名>[<分区字段><分区值>]csv’
dfto_csv(outputpathsep’’indexFalseheaderFalse)


#方法

df sparkreadcsv(r<存储路径><表名>[<分区字段><分区值>]csv encoding'gbk' headerTrue inferSchemaTrue) # header表示数第行否列名inferSchema表示动推断schema时未指定schema

者:

df sparkreadcsv(r<存储路径><表名>[<分区字段><分区值>]csv encoding'gbk' headerTrue schemaschema)#指定schema

#方法二

df sparkreadformat(csv)option(headerTrue)option(encoding gbk)load(r<存储路径><表名>[<分区字段><分区值>]csv)

者:

df sparkreadformat(csv)option(encodinggbk)option(headerTrue)load(r<存储路径><表名>[<分区字段><分区值>]csv schemaschema)

# 写csv例追加数格式:

dfwritemode('append')option()option()format()save()

#注意:数建立csv第行列名情况列名时应该掉header属性
权限控制
查询户权限:
Global level privileges
SELECT CONCAT(user '@' host)delete_privdrop_priv FROM mysqluser

Table level privileges
select CONCAT(user '@' host)usertable from mysqltables_priv

SHOW GRANTS
查询户权限:
SELECT
UserName
DatabaseName
TableName
ColumnName
AccessRight
GrantAuthority
GrantorName
AllnessFlag
CreatorName
CreateTimeStamp
FROM dbcallrights
WHERE username'<户ID>'
AND databasename'<库名>'
 
查询户权限
execute <库名>AllUserRights ('<户名>')
UDF宏定义
create macro <库名>AllUserRights (UserName char(128)) as (
locking row for access select
UserName (varchar(128))
AccessType (varchar(128))
RoleName (varchar(128))
DatabaseName (varchar(128))
TableName (varchar(128))
ColumnName (varchar(128))
AccessRight
case
when accessright'AE' then 'ALTER EXTERNALPROCEDURE'
when accessright'AF' then 'ALTER FUNCTION'
when accessright'AP' then 'ALTER PROCEDURE'
when accessright'AS' then 'ABORT SESSION'
when accessright'CA' then 'CREATE AUTHORIZATION'
when accessright'CD' then 'CREATE DATABASE'
when accessright'CE' then 'CREATE EXTERNAL PROCEDURE'
when accessright'CF' then 'CREATE FUNCTION'
when accessright'CG' then 'CREATE TRIGGER'
when accessright'CM' then 'CREATE MACRO'
when accessright'CO' then 'CREATE PROFILE'
when accessright'CP' then 'CHECKPOINT'
when accessright'CR' then 'CREATE ROLE'
when accessright'CS' then 'CREATE SERVER'
when accessright'CT' then 'CREATE TABLE'
when accessright'CU' then 'CREATE USER'
when accessright'CV' then 'CREATE VIEW'
when accessright'CZ' then 'CREATE ZONE'
when accessright'C1' then 'CREATE DATASET SCHEMA'
when accessright'D' then 'DELETE'
when accessright'DA' then 'DROP AUTHORIZATION'
when accessright'DD' then 'DROP DATABASE'
when accessright'DF' then 'DROP FUNCTION'
when accessright'DG' then 'DROP TRIGGER'
when accessright'DM' then 'DROP MACRO'
when accessright'DO' then 'DROP PROFILE'
when accessright'DP' then 'DUMP'
when accessright'DR' then 'DROP ROLE'
when accessright'DS' then 'DROP SERVER'
when accessright'DT' then 'DROP TABLE'
when accessright'DU' then 'DROP USER'
when accessright'DV' then 'DROP VIEW'
when accessright'DZ' then 'DROP ZONE'
when accessright'D1' then 'DROP DATASET SCHEMA'
when accessright'E' then 'EXECUTE'
when accessright'EF' then 'EXECUTE FUNCTION'
when accessright'GC' then 'CREATE GLOP'
when accessright'GD' then 'DROP GLOP'
when accessright'GM' then 'GLOP MEMBER'
when accessright'I' then 'INSERT'
when accessright'IX' then 'INDEX'
when accessright'MC' then 'CREATE MAP'
when accessright'MD' then 'DROP MAP'
when accessright'MR' then 'MONITOR RESOURCE'
when accessright'MS' then 'MONITOR SESSION'
when accessright'NT' then 'NONTEMPORAL'
when accessright'OD' then 'OVERRIDE DELETE POLICY'
when accessright'OI' then 'OVERRIDE INSERT POLICY'
when accessright'OP' then 'CREATE OWNER PROCEDURE'
when accessright'OS' then 'OVERRIDE SELECT POLICY'
when accessright'OU' then 'OVERRIDE UPDATE POLICY'
when accessright'PC' then 'CREATE PROCEDURE'
when accessright'PD' then 'DROP PROCEDURE'
when accessright'PE' then 'EXECUTE PROCEDURE'
when accessright'R' then 'RETRIEVESELECT'
when accessright'RF' then 'REFERENCES'
when accessright'RS' then 'RESTORE'
when accessright'SA' then 'SECURITY CONSTRAINT ASSIGNMENT'
when accessright'SD' then 'SECURITY CONSTRAINT DEFINITION'
when accessright'ST' then 'STATISTICS'
when accessright'SS' then 'SET SESSION RATE'
when accessright'SR' then 'SET RESOURCE RATE'
when accessright'TH' then 'CTCONTROL'
when accessright'U' then 'UPDATE'
when accessright'UU' then 'UDT Usage'
when accessright'UT' then 'UDT Type'
when accessright'UM' then 'UDT Method'
when accessright'W1' then 'WITH DATASET SCHEMA'
when accessright'ZO' then 'ZONE OVERRIDE'
else''
end (varchar(26)) as AccessRightDesc
GrantAuthority
GrantorName (varchar(128))
AllnessFlag
CreatorName (varchar(128))
CreateTimeStamp
from
(
select
UserName
'User' (varchar(128)) as AccessType
'' (varchar(128)) as RoleName
DatabaseName
TableName
ColumnName
AccessRight
GrantAuthority
GrantorName
AllnessFlag
CreatorName
CreateTimeStamp
from dbcallrights
where UserName username
and CreatorName not username
union all
select
Grantee as UserName
'Member' as UR
rRoleName
DatabaseName
TableName
ColumnName
AccessRight
null (char(1)) as GrantAuthority
GrantorName
null (char(1)) as AllnessFlag
null (char(1)) as CreatorName
CreateTimeStamp
from dbcallrolerights r
join dbcrolemembers m
on mRoleName rRoleName
where UserName username
union all
select
User as UserName
mGrantee as UR
rRoleName
DatabaseName
TableName
ColumnName
AccessRight
null (char(1)) as GrantAuthority
GrantorName
null (char(1)) as AllnessFlag
null (char(1)) as CreatorName
CreateTimeStamp
from dbcallrolerights r
join dbcrolemembers m
on mRoleName rRoleName
where mgrantee in (select rolename from dbcrolemembers where grantee
username)
) AllRights
order by 4567 )
 
通dbcallrights表中UserName列DatabaseName列TableName列AccessRight列查询获取指定指定数库中指定表操作权限执行某条SQL语句前判定前户否执行语句权限权限足时尝试动授权(太安全执行完应revoke)等措施
AccessRight列缩写词应列表(40):
 
AccessRight
含义
AF
ALTER FUNCTION
AP
ALTER PROCEDURE
AS
ABORT SESSION
CD
CREATE DATABASE
CF
CREATE FUNCTION
CG
CREATE TRIGGER
CM
CREATE MACRO
CO
CREATE PROFILE
CP
CHECKPOINT
CR
CREATE ROLE
CT
CREATE TABLE
CU
CREATE USER
CV
CREATE VIEW
D
DELETE
DD
DROP DATABASE
DF
DROP FUNCTION
DG
DROP TRIGGER
DM
DROP MACRO
DO
DROP PROFILE
DP
DUMP
DR
DROP ROLE
DT
DROP TABLE
DU
DROP USER
DV
DROP VIEW
E
EXECUTE
EF
EXECUTE FUNCTION
I
INSERT
IX
INDEX
MR
MONITOR RESOURCE
MS
MONITOR SESSION
PC
CREATE PROCEDURE
PD
DROP PROCEDURE
PE
EXECUTE PROCEDURE
RO
REPLICATION OVERRIDE
R
RETRIEVESELECT
RF
REFERENCE
RS
RESTORE
SS
SET SESSION RATE
SR
SET RESOURCE RATE
U
UPDATE
示例SQL语句:
 
select username databasename tablename accessright from dbcallrights
where databasename'systemfe' and username'dbc' and tablename'opt_ras_table'
述语句执行结果:
*** Query completed 12 rows found 4 columns returned
*** Total elapsed time was 1 second

UserName DatabaseName TableName AccessRight

DBC SystemFe opt_ras_table DT
DBC SystemFe opt_ras_table U
DBC SystemFe opt_ras_table DG
DBC SystemFe opt_ras_table RF
DBC SystemFe opt_ras_table RS
DBC SystemFe opt_ras_table R
DBC SystemFe opt_ras_table I
DBC SystemFe opt_ras_table CG
DBC SystemFe opt_ras_table ST
DBC SystemFe opt_ras_table DP
DBC SystemFe opt_ras_table D
DBC SystemFe opt_ras_table IX
SQL语句动构建出授予权限SQL语句( GRANT语句):
 
SEL
TRIM(username)
TRIM(databasename)
TRIM(tablename)
'GRANT '|| CASE
WHEN AccessRight 'AF ' THEN 'ALTER FUNCTION'
WHEN AccessRight 'AP ' THEN 'ALTER PROCEDURE'
WHEN AccessRight 'AS ' THEN 'ABORT SESSION'
WHEN AccessRight 'CD ' THEN 'CREATE DATABASE'
WHEN AccessRight 'CF ' THEN 'CREATE FUNCTION'
WHEN AccessRight 'CG ' THEN 'CREATE TRIGGER'
WHEN AccessRight 'CM ' THEN 'CREATE MACRO'
WHEN AccessRight 'CO ' THEN 'CREATE PROFILE'
WHEN AccessRight 'CP ' THEN 'CHECKPOINT'
WHEN AccessRight 'CR ' THEN 'CREATE ROLE'
WHEN AccessRight 'CT ' THEN 'CREATE TABLE'
WHEN AccessRight 'CU ' THEN 'CREATE USER'
WHEN AccessRight 'CV ' THEN 'CREATE VIEW'
WHEN AccessRight 'D ' THEN 'DELETE'
WHEN AccessRight 'DD ' THEN 'DROP DATABASE'
WHEN AccessRight 'DF ' THEN 'DROP FUNCTION'
WHEN AccessRight 'DG ' THEN 'DROP TRIGGER'
WHEN AccessRight 'DM ' THEN 'DROP MACRO'
WHEN AccessRight 'DO ' THEN 'DROP PROFILE'
WHEN AccessRight 'DP ' THEN 'DUMP'
WHEN AccessRight 'DR ' THEN 'DROP ROLE'
WHEN AccessRight 'DT ' THEN 'DROP TABLE'
WHEN AccessRight 'DU ' THEN 'DROP USER'
WHEN AccessRight 'DV ' THEN 'DROP VIEW'
WHEN AccessRight 'E ' THEN 'EXECUTE'
WHEN AccessRight 'EF ' THEN 'EXECUTE FUNCTION'
WHEN AccessRight 'I ' THEN 'INSERT'
WHEN AccessRight 'IX ' THEN 'INDEX'
WHEN AccessRight 'MR ' THEN 'MONITOR RESOURCE'
WHEN AccessRight 'MS ' THEN 'MONITOR SESSION'
WHEN AccessRight 'PC ' THEN 'CREATE PROCEDURE'
WHEN AccessRight 'PD ' THEN 'DROP PROCEDURE'
WHEN AccessRight 'PE ' THEN 'EXECUTE PROCEDURE'
WHEN AccessRight 'RO ' THEN 'REPLICATION OVERRIDE'
WHEN AccessRight 'R ' THEN 'RETRIEVESELECT'
WHEN AccessRight 'RF ' THEN 'REFERENCE'
WHEN AccessRight 'RS ' THEN 'RESTORE'
WHEN AccessRight 'SS ' THEN 'SET SESSION RATE'
WHEN AccessRight 'SR ' THEN 'SET RESOURCE RATE'
WHEN AccessRight 'U ' THEN 'UPDATE'
END || ' ON '||TRIM(databasename)||''||TRIM(tablename)||' to '||TRIM(username)||'' AS Permission
FROM dbcAllRights
WHERE DatabaseName '<库名>' and USERNAME '<户名>' AND TABLENAME '<表名>'
Hive表中:
认证(authentication):验证户身份否
授权(authorization):验证户身份操作否权限
目前hive(版0120)支持简单权限理默认情况开启样户具相权限时超级理员hive中表查改动权利样符合般数仓库安全原Hive基元数权限理基文件存储级权限理次介绍MetaData权限理通配置开启Hive身份认证功进行权限检查:

配置
1开启启身份认证户必须grant privilege实体进行操作
hivesecurityauthorizationenabled true

2表示创建表时动赋予户角色相应权限
hivesecurityauthorizationcreatetableownergrants ALL
hivesecurityauthorizationcreatetablerolegrants admin_roleALL
hivesecurityauthorizationcreatetableusergrants user1user2selectuser3create

3< 假出现错误: Error while compiling statement FAILED SemanticException The current builtin authorization in Hive is incomplete and disabled 需配置面属性 >
hivesecurityauthorizationtaskfactory orgapachehadoophiveqlparseauthorizationHiveAuthorizationTaskFactoryImpl

角色理
创建删角色
create role role_name
drop role role_name
展示roles
show roles
赋予角色权限
grant select on database db_name to role role_name
grant select on [table] t_name to role role_name
查角色权限
show grant role role_name on database db_name
show grant role role_name on [table] t_name
角色赋予户
grant role role_name to user user_name
回收角色权限
revoke select on database db_name from role role_name
revoke select on [table] t_name from role role_name
查某户角色
show role grant user user_name

超级权限
Hive权限功需完善方超级理员
Hive中没超级理员户进行GrantRevoke操作完善超级理员必须添加hivesemanticanalyzerhook配置实现权限控制类

hivesemanticanalyzerhook commycompanyAuthHook

编译面代码(需导入赖antlrruntime34jarhiveexec0120cdh512jar)
包成jar放置hiveclasspath(客户端hive shell机hiveenvsh 中环境变量:HIVE_AUX_JARS_PATH指路径配置仅hive shell生效)
hivesitexml中添加参数hiveauxjarspath(目前仅支持路径) fileusrlibhivelibHiveAuthHookjar(配置仅hive server效)重启hiveserver


package comnewland

import orgapachehadoophiveqlparseASTNode
import orgapachehadoophiveqlparseAbstractSemanticAnalyzerHook
import orgapachehadoophiveqlparseHiveParser
import orgapachehadoophiveqlparseHiveSemanticAnalyzerHookContext
import orgapachehadoophiveqlparseSemanticException
import orgapachehadoophiveqlsessionSessionState

public class AuthHook extends AbstractSemanticAnalyzerHook {
private static String[] admin { root hadoop }

@Override
public ASTNode preAnalyze(HiveSemanticAnalyzerHookContext context
ASTNode ast) throws SemanticException {
switch (astgetToken()getType()) {
case HiveParserTOK_CREATEDATABASE
case HiveParserTOK_DROPDATABASE
case HiveParserTOK_CREATEROLE
case HiveParserTOK_DROPROLE
case HiveParserTOK_GRANT
case HiveParserTOK_REVOKE
case HiveParserTOK_GRANT_ROLE
case HiveParserTOK_REVOKE_ROLE
String userName null
if (SessionStateget() null
&& SessionStateget()getAuthenticator() null) {
userName SessionStateget()getAuthenticator()getUserName()
}
if (admin[0]equalsIgnoreCase(userName)
&& admin[1]equalsIgnoreCase(userName)) {
throw new SemanticException(userName
+ can't use ADMIN options except + admin[0] +
+ admin[1] + )
}
break
default
break
}
return ast
}

public static void main(String[] args) throws SemanticException {
String[] admin { admin root }
String userName root
for (String tmp admin) {
Systemoutprintln(tmp)
if (tmpequalsIgnoreCase(userName)) {
throw new SemanticException(userName
+ can't use ADMIN options except + admin[0] +
+ admin[1] + )
}
}
}
}

HIVE支持权限:

权限名称 含义
ALL 权限
ALTER 允许修改元数(modify metadata data of object)表信息数
UPDATE 允许修改物理数(modify physical data of object)实际数
CREATE 允许进行Create操作
DROP 允许进行DROP操作
INDEX 允许建索引(目前没实现)
LOCK 出现发允许户进行LOCKUNLOCK操作
SELECT 允许户进行SELECT操作
SHOW_DATABASE 允许户查数库
附:
登录hive元数库发现表
Db_privs记录UserRoleDB权限
Tbl_privs记录UserRoletable权限
Tbl_col_privs:记录UserRoletable column权限
Roles:记录创建role
Role_map:记录UserRole应关系
行列转换
列转行:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
SUM(CASE <分类字段> WHEN <分类值1> THEN <度量字段> ELSE 0 END) AS <分类值名1>
SUM(CASE <分类字段> WHEN <分类值2> THEN <度量字段> ELSE 0 END) AS <分类值名2>
SUM(CASE <分类字段> WHEN <分类值3> THEN <度量字段> ELSE 0 END) AS <分类值名3>

SUM(CASE <分类字段> WHEN <分类值n> THEN <度量字段> ELSE 0 END) AS <分类值名n>
FROM [`<架构名称>`] `<表名>`
GROUP BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>

SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<分类字段><度量字段1> <度量字段2> <度量字段3>… <度量字段n>
FROM [`<架构名称>`] `<表名>`
PIVOT(SUM(<度量字段1>) AS <度量字段1>SUM(<度量字段2>) AS <度量字段2>SUM(<度量字段3>) AS <度量字段3>…SUM(<度量字段n>) AS <度量字段n> FOR <分类字段> IN (<分类值1><分类值2><分类值3>…<分类值n>))

SET @sql NULL
SELECT
GROUP_CONCAT(DISTINCT
CONCAT(
'SUM(CASE <分类字段> WHEN '''
<分类字段>
''' THEN IFNULL(<度量字段>0) ELSE 0 END) AS `'
<分类字段> '`'
)
) INTO @sql
FROM
(
SELECT DISTINCT <分类字段>
FROM [`<架构名称>`] `<表名>`
ORDER BY <分类字段>
) T

SET @sql
CONCAT('SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>

@sql
' FROM [`<架构名称>`] `<表名>`
GROUP BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>')

PREPARE stmt FROM @sql
EXECUTE stmt
DEALLOCATE PREPARE stmt

列转行分类值字符串连接:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
GROUP_CONCAT(TRIM(<分类字段>)) AS <分类字段> GROUP_CONCAT(CAST(<度量字段> AS VARCHAR(30))) AS <度量字段>
FROM [`<架构名称>`] `<表名>`
GROUP BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>

行转列:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<分类字段><度量字段> from [`<架构名称>`] `<表名>`
UNPIVOT
(<度量字段> FOR <分类字段> IN (<分类值1><分类值2><分类值3>…<分类值n>))

逗号分隔数拆分成行:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
substring_index(substring_index(a<带逗号数字段>''bhelp_topic_id+1)''1)
FROM [`<架构名称>`] `<表名>` a
JOIN mysqlhelp_topic b
ON bhelp_topic_id < (length(a<带逗号数字段>) length(replace(a<带逗号数字段>''''))+1)
ORDER BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
列转行:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<分类字段><度量字段1> <度量字段2> <度量字段3>… <度量字段n>
FROM [<架构名称>] <表名>
PIVOT(SUM(<度量字段1>) AS <度量字段1>SUM(<度量字段2>) AS <度量字段2>SUM(<度量字段3>) AS <度量字段3>…SUM(<度量字段n>) AS <度量字段n> for <分类字段> IN (<分类值1><分类值2><分类值3>…<分类值n>))

列转行分类值字符串连接:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
CAST(tdstatsudfconcat(TRIM(<分类字段>)) AS varchar(500)) AS <分类字段>tdstatsudfconcat(CAST(<度量字段> AS VARCHAR(500))) AS <度量字段>
FROM [<架构名称>] <表名>
GROUP BY 123…n

SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
TRIM(TRAILING '' FROM (XMLAGG(<分类字段> || '')(VARCHAR(500)))) AS <分类字段>TRAILING '' FROM (XMLAGG(CAST(<度量字段> AS VARCHAR(500)) || '')) AS <度量字段>
FROM [<架构名称>] <表名>
GROUP BY 123…n

行转列:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<分类字段><度量字段1> <度量字段2> <度量字段3>… <度量字段n>
FROM [<架构名称>] <表名> UNPIVOT [{INCLUDE|EXCLUDE} NULLS] (
(<度量字段1> <度量字段2> <度量字段3>… <度量字段n>)
FOR <分类字段> IN (
(<分类值1> <分类值2> <分类值3>…<分类值n>) AS '<分类值123…n名>'
(<分类值n+1> <分类值n+2> <分类值n+3>…<分类值2n>) AS '<分类值n+1n+2n+3…2n名>'
(<分类值2n+1> <分类值2n+2> <分类值2n+3>…<分类值3n>) AS '<分类值2n+12n+22n+3…3n名>'

(<分类值mn+1> <分类值mn+2> <分类值mn+3>…<分类值(m+1)n>) AS '<分类值mn+1mn+2mn+3…(m+1)n名>'
)
) T

逗号分隔数拆分成行:
USE [<架构名称>]

SELECT A* FROM TABLE (strtok_split_to_table( <表名><维度字段1>
<表名><维度字段2>
<表名><维度字段3>

<表名><维度字段n>
<表名><带逗号数字段> '')
RETURNS (<维度字段名1> <维度字段类型1>
<维度字段名2> <维度字段类型2>
<维度字段名3> <维度字段类型3>

<维度字段名n> <维度字段类型n>
<带逗号数字段名>_num integer <带逗号数字段名> varchar(100) character set unicode) ) AS A
ORDER BY 123…n
PySpark里:
列转行:
import pysparksqlfunctions as func

<表名>_df…
<表名>_dfgroupBy('<分类字段>') \
pivot('项目' ['<度量字段名1>' '<度量字段名2>' '<度量字段名3>'… <度量字段名m>]) \
agg(funcsum('<度量字段>')) \
fillna(0)

列转行分类值字符串连接:
from pysparksql import functions as func
dfgroupby(<分类字段>)agg(funccollect_set(<度量字段1>)funccollect_list(<度量字段2>)funccollect_list(<度量字段3>)…funccollect_list(<度量字段m>))

行转列:
sparksql('''
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
stack(m '<度量字段名1>' <度量字段1> '<度量字段名2>' <度量字段2> '<度量字段名3>' <度量字段3> '<度量字段名m>' <度量字段m>) AS (
<分类字段>
<度量字段名>
)
FROM <表名>
[WHERE <筛选条件>]
[ORDER BY <排序字段>]
''')

<表名>_df…
<表名>_dfselectExpr(`<分类字段>`
stack(m '<度量字段名1>' `<度量字段1>` '<度量字段名2>' `<度量字段2>` '<度量字段名3>' `<度量字段3>` '<度量字段名m>' `<度量字段m>`) AS (`<分类字段>``<度量字段名>`)) \
[filter([筛选条件]) \]
[orderBy([`年月` `项目`])]

列转行分类值字符串连接:
from pysparksqlfunctions import split explode
<表名>_df…
<表名>_dfwithColumn('<带逗号数字段名>'explode(split('<带逗号数字段>'' ')))

Hive表中:
列转行:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
SUM(CASE <分类字段> WHEN <分类值1> THEN <度量字段> ELSE 0 END) AS <分类值名1>
SUM(CASE <分类字段> WHEN <分类值2> THEN <度量字段> ELSE 0 END) AS <分类值名2>
SUM(CASE <分类字段> WHEN <分类值3> THEN <度量字段> ELSE 0 END) AS <分类值名3>

SUM(CASE <分类字段> WHEN <分类值n> THEN <度量字段> ELSE 0 END) AS <分类值名n>
FROM [<架构名称>] <表名>
[WHERE <筛选条件>]
GROUP BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>

列转行分类值字符串连接:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
concat_ws(''collect_set(TRIM(<分类字段>))) AS <分类字段>concat_ws(''collect_set(CAST(<度量字段> AS STRING))) AS <度量字段>
FROM [<架构名称>] <表名>
GROUP BY <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>

行转列:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<表名名><字段名>
FROM [<架构名称>] <表名>
explode(<度量字段>) <表名名> AS <字段名>

逗号分隔数拆分成行:
SELECT <维度字段1>
<维度字段2>
<维度字段3>

<维度字段n>
<表名名><带逗号数字段名>
FROM [<架构名称>] <表名>
explode(split(<带逗号数字段>'')) <表名名> AS <带逗号数字段名>
数采样
SELECT AVG(<度量字段>) FROM [`<架构名称>`] `<表名>` GROUP BY <键字段> DIV 10000
通<键字段>10000间隔采样采样块均值中:AVG()MAX()MIN()SUM()COUNT()等聚合操作
<表名>_dfsample(withReplacementboolean fractiondoubleseedlong)

sample算子时抽样3参数

withReplacement:表示抽出样否放回true表示会放回意味着抽出样重复

fraction :抽出少double类型参数01间eg03表示抽出30

seed:表示种子根seed机抽取般情况前两参数参数干嘛呢参数般调试时候知道程序出问题数出问题参数设置定值

机采样列值满足特定条件Pyspark数框sample方法根列值机选择行假设数框:

++++++
| id|code| amt|flag_outliers|result|
++++++
| 1| a| 109| 0| 00|
| 2| b| 207| 0| 00|
| 3| c| 304| 0| 10|
| 4| d| 4098| 0| 10|
| 5| e| 5021| 0| 20|
| 6| f| 607| 0| 20|
| 7| g| 708| 0| 20|
| 8| h| 8043| 0| 30|
| 9| i| 9012| 0| 30|
| 10| j|10065| 0| 30|
++++++
想0 1 2 3基该result列抽样1(特定数量)终点:

++++++
| id|code| amt|flag_outliers|result|
++++++
| 1| a| 109| 0| 00|
| 3| c| 304| 0| 10|
| 5| e| 5021| 0| 20|
| 8| h| 8043| 0| 30|
++++++
否种良编程方式实现目标某列中出值采相数量行?帮助非常感谢


解决方案
您sampleBy()返回分层样需根层次出分数进行换

>>> from pysparksqlfunctions import col
>>> dataset sqlContextrange(0 100)select((col(id) 3)alias(result))
>>> sampled datasetsampleBy(result fractions{0 01 1 02} seed0)
>>> sampledgroupBy(result)count()orderBy(key)show()

+++
|result|count|
+++
| 0| 5|
| 1| 9|
+++
SELECT * FROM [<架构名称>] <表名> SAMPLE 1000   采样1000条数
SELECT * FROM [<架构名称>] <表名> SAMPLE 025   采样25数
时表
CREATE TEMPORARY TABLE `<表名>` (
<字段名1> <字段类型1> [NOT NULL] [DEFAULT <默认值1>]
<字段名2> <字段类型2> [NOT NULL] [DEFAULT <默认值2>]
<字段名3> <字段类型3> [NOT NULL] [DEFAULT <默认值3>]

<字段名n> <字段类型n> [NOT NULL] [DEFAULT <默认值4>]
)

INSERT INTO `<表名>`
<查询语句>

CREATE TEMPORARY TABLE `<表名>` AS
(
<查询语句>
)
CREATE {MULTISET|SET} [GLOBAL] TEMPORARY TABLE <表名> AS
(
<查询语句>
)
WITH NO DATA
ON COMMIT PRESERVE ROWS

CREATE VOLATILE TABLE <表名> LOG
(
<字段名1> <字段类型1>
<字段名2> <字段类型2>
<字段名3> <字段类型3>

<字段名n> <字段类型n>
)
ON COMMIT PRESERVE ROWS

INSERT INTO <表名>
<查询语句>
<表名>_dfsparksql(
<查询语句>
)

<表名>_dfregisterTempTable(<表名>)

<表名>_df sparkreadload(
path'<存储路径><表名>'
format'parquet' headerTrue)

<表名>_df registerTempTable(<表名>)

from pysparksql import HiveContext SparkSession

# 初始化SparkContext时启Hive支持
# 终端命令行测试模式输出字段长度设置100字符
spark SparkSessionbuilderappName(name)config(
sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()
hive HiveContext(sparksparkContext)
<表名>_df hivesql(
<查询语句>)

<表名>_df registerTempTable(<表名>)
时间日期转换
1时间格式转换
时间'20190122 154506' 转换成 unix 时间戳
SELECT UNIX_TIMESTAMP(<时间日期字段>) FROM [`<架构名称>`] `<表名>`

字符串转时间格式化
SELECT str_to_date('<年><月><日>''Ymd') FROM [`<架构名称>`] `<表名>`

时间转字符串 格式化
SELECT DATE_FORMAT(<时间日期字段>'Ymd His') FROM [`<架构名称>`] `<表名>`

2 时区转换
加N时
SELECT DATE_ADD(<时间日期字段> INTERVAL N HOUR) from [`<架构名称>`] `<表名>`

减N时
SELECT DATE_SUB(<时间日期字段>INTERVAL N HOUR) from [`<架构名称>`] `<表名>`

转换时区
SELECT CONVERT_TZ(<时间日期字段>+0800+0100) from [`<架构名称>`] `<表名>`

11 获前日期+时间(date + time)函数:now()
now() 函数获前日期时间外MySQL 中面函数:
current_timestamp()   current_timestamp
localtime()   localtime
localtimestamp()   localtimestamp    
日期时间函数等 now()鉴 now() 函数简短易记建议总 now() 代面列出函数
 
12 获前日期+时间(date + time)函数:sysdate()
sysdate() 日期时间函数 now() 类似处:now() 执行开始时值 sysdate() 函数执行时动态值
 
2 获前日期(date)函数:curdate()
中面两日期函数等 curdate(): current_date()current_date
 
3 获前时间(time)函数:curtime()
中面两时间函数等 curtime():current_time()current_time
 
4 获前 UTC 日期时间函数:utc_date() utc_time() utc_timestamp()
国位东八时区时间 UTC 时间 + 8 时UTC 时间业务涉国家区时候非常
 
二MySQL 日期时间 Extract(选取) 函数
1 选取日期时间部分:日期时间年季度月日时分钟秒微秒
set @dt '20080910 071530123456'
 
select date(@dt) 20080910
select time(@dt) 071530123456
select year(@dt) 2008
select quarter(@dt) 3
select month(@dt) 9
select week(@dt) 36
select day(@dt) 10
select hour(@dt) 7
select minute(@dt) 15
select second(@dt) 30
select microsecond(@dt) 123456
 
2 MySQL Extract() 函数面实现类似功:
set @dt '20080910 071530123456'
 
select extract(year from @dt) 2008
select extract(quarter from @dt) 3
select extract(month from @dt) 9
select extract(week from @dt) 36
select extract(day from @dt) 10
select extract(hour from @dt) 7
select extract(minute from @dt) 15
select extract(second from @dt) 30
select extract(microsecond from @dt) 123456
select extract(year_month from @dt) 200809
select extract(day_hour from @dt) 1007
select extract(day_minute from @dt) 100715
select extract(day_second from @dt) 10071530
select extract(day_microsecond from @dt) 10071530123456
select extract(hour_minute from @dt) 715
select extract(hour_second from @dt) 71530
select extract(hour_microsecond from @dt) 71530123456
select extract(minute_second from @dt) 1530
select extract(minute_microsecond from @dt) 1530123456
select extract(second_microsecond from @dt) 30123456
MySQL Extract() 函数没date()time()
功外功应具全具选取day_microsecond’ 等功注意里选取 day
microsecond日期 day 部分直选取 microsecond 部分
MySQL Extract() 函数唯方:需敲次键盘
 
3 MySQL dayof… 函数:dayofweek() dayofmonth() dayofyear()
分返回日期参数周月年中位置
set @dt '20080808'
select dayofweek(@dt) 6
select dayofmonth(@dt) 8
select dayofyear(@dt) 221
日期 20080808′ 周中第 6 天(1 Sunday 2 Monday … 7 Saturday)月中第 8 天年中第 221 天
 
4 MySQL week… 函数:week() weekofyear() dayofweek() weekday() yearweek()
set @dt '20080808'
select week(@dt) 31
select week(@dt3) 32
select weekofyear(@dt) 32
select dayofweek(@dt) 6
select weekday(@dt) 4
select yearweek(@dt) 200831
MySQL week() 函数两参数具体手册 weekofyear() week() 样计算某天位年中第周 weekofyear(@dt) 等价 week(@dt3)
MySQL weekday() 函数 dayofweek() 类似返回某天周中位置点参考标准
weekday:(0 Monday 1 Tuesday … 6 Sunday) dayofweek:(1 Sunday
2 Monday … 7 Saturday)
MySQL yearweek() 函数返回 year(2008) + week 位置(31)
 
5 MySQL 返回星期月份名称函数:dayname() monthname()
set @dt '20080808'
select dayname(@dt) Friday
select monthname(@dt) August
 
6 MySQL last_day() 函数:返回月份中天
select last_day('20080201') 20080229
select last_day('20080808') 20080831
 
 
三MySQL 日期时间计算函数
1 MySQL 日期增加时间间隔:date_add()
set @dt now()
select date_add(@dt interval 1 day) add 1 day
select date_add(@dt interval 1 hour) add 1 hour
select date_add(@dt interval 1 minute)
select date_add(@dt interval 1 second)
select date_add(@dt interval 1 microsecond)
select date_add(@dt interval 1 week)
select date_add(@dt interval 1 month)
select date_add(@dt interval 1 quarter)
select date_add(@dt interval 1 year)select date_add(@dt interval 1 day) sub 1 day
 
MySQL adddate() addtime()函数 date_add() 代面 date_add() 实现 addtime() 功示例:
mysql> set @dt '20080809 121233'
mysql> select date_add(@dt interval '011530' hour_second)
++
| date_add(@dt interval '011530' hour_second) |
++
| 20080809 132803 |
++
mysql> select date_add(@dt interval '1 011530' day_second)
++
| date_add(@dt interval '1 011530' day_second) |
++
| 20080810 132803 |
++
date_add() 函数分 @dt 增加1时 15分 30秒 1天 1时 15分 30秒
建议:总 date_add() 日期时间函数代 adddate() addtime()


 
2 MySQL 日期减时间间隔:date_sub()
MySQL date_sub() 日期时间函数 date_add() 法致赘述外MySQL 中两函数 subdate() subtime()建议 date_sub() 代
 
3 MySQL 类日期函数:period_add(PN) period_diff(P1P2)
函数参数P 格式YYYYMM 者 YYMM第二参数N 表示增加减 N month(月)
MySQL period_add(PN):日期加减N月
 
4 MySQL 日期时间相减函数:datediff(date1date2) timediff(time1time2)
MySQL datediff(date1date2):两日期相减 date1 date2返回天数
select datediff('20080808' '20080801') 7
select datediff('20080801' '20080808') 7
MySQL timediff(time1time2):两日期相减 time1 time2返回 time 差值
select timediff('20080808 080808' '20080808 000000') 080808
select timediff('080808' '000000') 080808
注意:timediff(time1time2) 函数两参数类型必须相
 
四MySQL 日期转换函数时间转换函数
1 MySQL (时间秒)转换函数:time_to_sec(time) sec_to_time(seconds)
select time_to_sec('010005') 3605
select sec_to_time(3605) '010005'
 
2 MySQL (日期天数)转换函数:to_days(date) from_days(days)
select to_days('00000000') 0
select to_days('20080808') 733627
select from_days(0) '00000000'
select from_days(733627) '20080808'
 
3 MySQL Str to Date (字符串转换日期)函数:str_to_date(str format)
select str_to_date('08092008' 'mdY') 20080809
select str_to_date('080908' 'mdy') 20080809
select str_to_date('08092008' 'mdY') 20080809
select str_to_date('080930' 'his') 080930
select str_to_date('08092008 080930' 'mdY his') 20080809 080930
str_to_date(strformat) 转换函数杂乱章字符串转换日期格式外转换时间format 参 MySQL 手册
 
4 MySQL DateTime to Str(日期时间转换字符串)函数:date_format(dateformat) time_format(timeformat)
MySQL 日期时间转换函数:date_format(dateformat) time_format(timeformat)
够日期时间转换成种样字符串格式 str_to_date(strformat) 函数 逆转换
 
5 MySQL 获国家区时间格式函数:get_format()
MySQL get_format() 语法:
get_format(date|time|datetime 'eur'|'usa'|'jis'|'iso'|'internal'
MySQL get_format() 法全部示例:
select get_format(date'usa') 'mdY'
select get_format(date'jis') 'Ymd'
select get_format(date'iso') 'Ymd'
select get_format(date'eur') 'dmY'
select get_format(date'internal') 'Ymd'
select get_format(datetime'usa') 'Ymd His'
select get_format(datetime'jis') 'Ymd His'
select get_format(datetime'iso') 'Ymd His'
select get_format(datetime'eur') 'Ymd His'
select get_format(datetime'internal') 'YmdHis'
select get_format(time'usa') 'his p'
select get_format(time'jis') 'His'
select get_format(time'iso') 'His'
select get_format(time'eur') 'His'
select get_format(time'internal') 'His'
MySQL get_format() 函数实际中机会较少
 
6 MySQL 拼凑日期时间函数:makdedate(yeardayofyear) maketime(hourminutesecond)
select makedate(200131) '20010131'
select makedate(200132) '20010201'select maketime(121530) '121530'
 
五MySQL 时间戳(Timestamp)函数
1 MySQL 获前时间戳函数:current_timestamp current_timestamp()
2 MySQL (Unix 时间戳日期)转换函数:
unix_timestamp()
unix_timestamp(date)
from_unixtime(unix_timestamp)
from_unixtime(unix_timestampformat)
 
3 MySQL 时间戳(timestamp)转换增减函数:
timestamp(date) date to timestamp
timestamp(dttime) dt + time
timestampadd(unitintervaldatetime_expr)
timestampdiff(unitdatetime_expr1datetime_expr2)
MySQL timestampdiff() 函数 datediff() 功强datediff() 计算两日期(date)间相差天数
 
六MySQL 时区(timezone)转换函数convert_tz(dtfrom_tzto_tz)select
convert_tz('20080808 120000' '+0800' '+0000') 20080808
040000
时区转换通 date_add date_sub timestampadd 实现
select date_add('20080808 120000' interval 8 hour) 20080808 040000
select date_sub('20080808 120000' interval 8 hour) 20080808 040000
select timestampadd(hour 8 '20080808 120000') 20080808 040000
1时间格式转换
字符串转时间格式化
SELECT CAST('<年><月><日>' AS DATE FORMAT 'YYYYMMDD')

毫秒转换时间戳
SELECT CAST (to_timestamp(CAST(15253140630001000 as BIGINT)) AS DATE) 结果20180503 022103000000

时间转字符串 格式化
SELECT CAST((<时间戳字段> (FORMAT ' YYYYMMDDBHHMISSS(6)')) AS VARCHAR(26))

SELECT CAST ((curent_timestamp(6) (FORMAT 'YYYYMMDDHHMISS') ) AS VARCHAR(19)) 结果20180615164201

SELECT CAST((<时间日期字段> (FORMAT 'YYYYMMDD')) AS VARCHAR(10))


格式化间互转
格式912 1445 91200 144500
SELECT CASE WHEN INDEX(dt_time’’)2 THEN 0’||dt_time||’00’ ELSE dt_time||’00’ END

格式2016012020160120
SELECT SUBSTRING(dt_date from 1 for 4)||’|| SUBSTRING(dt_date from 5 for 2) ||’|| SUBSTRING(dt_date from 7 for 2)


select to_month_end(date)
select extract(yearmonthday from date)
select lastday(datetimestamp)
select (datedate) day(3)month(3)year(3)
select months_between(datedate)
select (timetime) hour(3)minute(3)second(3)
select current_timestamp interval '2' hourmibutesecond
select current_date interval '2' yearmonthdayhourminutesecond
select next_day(date'friday''fri')
select numtoyminterval(20'month''year')
from datetime import datetime
from datetime import timedelta

# 日期格式转换字符串
NOW datetimenow()
TODAY NOWstrftime(Ymd)
YESTERDAY (NOW timedelta(days1))strftime(Ymd)

# 字符串转换日期
d1 str(20180301)
d2 str(20180226)

print type(d1)

d1 datetimestrptime(d1 'Ymd')
d2 datetimestrptime(d2 'Ymd')

# 计算 d1d2间差值
print (d1 d2)days

from pysparksqlfunctions import unix_timestamp from_unixtime

1 获取前日期

from pysparksqlfunctions import current_date

sparkrange(3)withColumn('date'current_date())show()
# +++
# | id| date|
# +++
# | 0|20180323|
# | 1|20180323|
2 获取前日期时间
from pysparksqlfunctions import current_timestamp

sparkrange(3)withColumn('date'current_timestamp())show()
# +++
# | id| date|
# +++
# | 0|20180323 1740|
# | 1|20180323 1740|
# | 2|20180323 1740|
# +++
3 日期格式转换

from pysparksqlfunctions import date_format

df sparkcreateDataFrame([('20150408')] ['a'])

dfselect(date_format('a' 'MMddyyy')alias('date'))show()

4 字符转日期

from pysparksqlfunctions import to_date to_timestamp

# 1转日期
df sparkcreateDataFrame([('19970228 103000')] ['t'])
dfselect(to_date(dft)alias('date'))show()
# [Row(datedatetimedate(1997 2 28))]


# 2带时间日期

df sparkcreateDataFrame([('19970228 103000')] ['t'])
dfselect(to_timestamp(dft)alias('dt'))show()
# [Row(dtdatetimedatetime(1997 2 28 10 30))]

# 指定日期格式
df sparkcreateDataFrame([('19970228 103000')] ['t'])
dfselect(to_timestamp(dft 'yyyyMMdd HHmmss')alias('dt'))show()
# [Row(dtdatetimedatetime(1997 2 28 10 30))]
5 获取日期中年月日

from pysparksqlfunctions import year month dayofmonth

df sparkcreateDataFrame([('20150408')] ['a'])
dfselect(year('a')alias('year')
month('a')alias('month')
dayofmonth('a')alias('day')
)show()
6 获取时分秒

from pysparksqlfunctions import hour minute second
df sparkcreateDataFrame([('20150408 130815')] ['a'])
dfselect(hour('a')alias('hour')
minute('a')alias('minute')
second('a')alias('second')
)show()
7 获取日期应季度

from pysparksqlfunctions import quarter

df sparkcreateDataFrame([('20150408')] ['a'])
dfselect(quarter('a')alias('quarter'))show()
8 日期加减

from pysparksqlfunctions import date_add date_sub
df sparkcreateDataFrame([('20150408')] ['d'])
dfselect(date_add(dfd 1)alias('dadd')
date_sub(dfd 1)alias('dsub')
)show()
9 月份加减

from pysparksqlfunctions import add_months
df sparkcreateDataFrame([('20150408')] ['d'])

dfselect(add_months(dfd 1)alias('d'))show()
10 日期差月份差

from pysparksqlfunctions import datediff months_between

# 1日期差
df sparkcreateDataFrame([('20150408''20150510')] ['d1' 'd2'])
dfselect(datediff(dfd2 dfd1)alias('diff'))show()

# 2月份差
df sparkcreateDataFrame([('19970228 103000' '19961030')] ['t' 'd'])
dfselect(months_between(dft dfd)alias('months'))show()
11 计算日子日期

计算前日期星期1234567具体日子属实函数

from pysparksqlfunctions import next_day

# Mon Tue Wed Thu Fri Sat Sun
df sparkcreateDataFrame([('20150727')] ['d'])
dfselect(next_day(dfd 'Sun')alias('date'))show()
12 月日期

from pysparksqlfunctions import last_day

df sparkcreateDataFrame([('19970210')] ['d'])
dfselect(last_day(dfd)alias('date'))show()

Hive常日期格式转换
固定日期转换成时间戳
SELECT unix_timestamp('20160816''yyyyMMdd') 1471276800
SELECT unix_timestamp('20160816''yyyyMMdd') 1471276800
SELECT unix_timestamp('20160816T100241Z' yyyyMMdd'T'HHmmss'Z') 1471312961

16Mar2017122501 +0800 转成正常格式(yyyyMMdd hhmmss)
SELECT from_unixtime(to_unix_timestamp('16Mar2017122501 +0800' 'ddMMMyyyHHmmss Z'))

时间戳转换程固定日期
SELECT from_unixtime(1471276800'yyyyMMdd') 20160816
SELECT from_unixtime(1471276800'yyyyMMdd') 20160816
SELECT from_unixtime(1471312961)     20160816 100241
SELECT from_unixtime( unix_timestamp('20160816''yyyyMMdd')'yyyyMMdd')  20160816
SELECT date_format('20160816''yyyyMMdd') 20160816

返回日期时间字段中日期部分
SELECT to_date('20160816 100301') 20160816
取前时间
SELECT from_unixtime(unix_timestamp()'yyyyMMdd HHmmss')
SELECT from_unixtime(unix_timestamp()'yyyyMMdd')
返回日期中年
SELECT year('20160816 100301') 2016
返回日期中月
SELECT month('20160816 100301') 8
返回日期中日
SELECT day('20160816 100301') 16
返回日期中时
SELECT hour('20160816 100301') 10
返回日期中分
SELECT minute('20160816 100301') 3
返回日期中秒
SELECT second('20160816 100301') 1

返回日期前周数
SELECT weekofyear('20160816 100301') 33

返回结束日期减开始日期天数
SELECT datediff('20160816''20160811') 

返回开始日期startdate增加days天日期
SELECT date_add('20160816'10)

返回开始日期startdate减少days天日期
SELECT date_sub('20160816'10)

返回天三种方式
SELECT CURRENT_DATE
20170615
SELECT CURRENT_TIMESTAMP返回时分秒
20170615 195444
SELECT from_unixtime(unix_timestamp())
20170615 195504
返回前时间戳
SELECT current_timestamp20180618 103753278

返回月第天
SELECT trunc('20160816''MM') 20160801
返回年第天
SELECT trunc('20160816''YEAR') 20160101

df sparkcreateDataFrame(

[(11251991) (11241991) (11301991)]

['date_str']

)

df2 dfselect(

'date_str'

from_unixtime(unix_timestamp('date_str' 'MMddyyy'))alias('date')

)

print(df2)

#DataFrame[date_str string date timestamp]

df2show(truncateFalse)

#+++

#|date_str |date |

#+++

#|11251991|19911125 000000|

#|11241991|19911124 000000|

#|11301991|19911130 000000|

#+++

更新(1102018):

Spark 22+方法formatformat函数支持format参数 文档:

>>> df sparkcreateDataFrame([('19970228 103000')] ['t'])

>>> dfselect(to_timestamp(dft 'yyyyMMdd HHmmss')alias('dt'))collect()

[Row(dtdatetimedatetime(1997 2 28 10 30))]

santon answered 20200102T131732Z

37 votes

from datetime import datetime

from pysparksqlfunctions import col udf

from pysparksqltypes import DateType

# Creation of a dummy dataframe

df1 sqlContextcreateDataFrame([(112519911124199111301991)

(112513911124199211301992)] schema['first' 'second' 'third'])

# Setting an user define function

# This function converts the string cell into a date

func udf (lambda x datetimestrptime(x 'mdY') DateType())

df df1withColumn('test' func(col('first')))

dfshow()

dfprintSchema()

输出:

+++++

| first| second| third| test|

+++++

|11251991|11241991|11301991|19910125|

|11251391|11241992|11301992|13910117|

+++++

root

| first string (nullable true)

| second string (nullable true)

| third string (nullable true)

| test date (nullable true)

Hugo Reyes answered 20200102T131754Z

22 votes

strptime()方法起作 更清洁解决方案演员:

from pysparksqltypes import DateType

spark_df1 spark_dfwithColumn(record_datespark_df['order_submitted_date']cast(DateType()))

#below is the result

spark_df1select('order_submitted_date''record_date')show(10False)

+++

|order_submitted_date |record_date|

+++

|20150819 1254160|20150819 |

|20160414 1355500|20160414 |

|20131011 1823360|20131011 |

|20150819 2018550|20150819 |

|20150820 1207400|20150820 |

|20131011 2124120|20131011 |

|20131011 2329280|20131011 |

|20150820 1659350|20150820 |

|20150820 1732030|20150820 |

|20160413 1656210|20160413 |

Frank answered 20200102T131814Z

7 votes

接受答案更新中您没to_date函数示例该函数种解决方案:

from pysparksql import functions as F

df dfwithColumn(

'new_date'

Fto_date(

Funix_timestamp('STRINGCOLUMN' 'MMddyyyy')cast('timestamp'))

Manrique answered 20200102T131835Z

1 votes

尝试:

df sparkcreateDataFrame([('20180727 103000')] ['Date_col'])

dfselect(from_unixtime(unix_timestamp(dfDate_col 'yyyyMMdd HHmmss'))alias('dt_col'))

dfshow()

++

| Date_col|

++

|20180727 103000|

++

Vishwajeet Pol answered 20200102T131855Z

1 votes

没答案想分享代码帮助某

from pysparksql import SparkSession

from pysparksqlfunctions import to_date

spark SparkSessionbuilderappName(Python Spark SQL basic example)\

config(sparksomeconfigoption somevalue)getOrCreate()

df sparkcreateDataFrame([('20190622')] ['t'])

df1 dfselect(to_date(dft 'yyyyMMdd')alias('dt'))

print df1

print df1show()

输出

DataFrame[dt date]

++

| dt|

++

|20190622|

++

果转换日期时间述代码转换日期然to_timestamp
PySpark代码基结构
# *codingutf8*
from pysparksql import HiveContext SparkSession

# 初始化SparkContext时启Hive支持
# 终端命令行测试模式输出字段长度设置100字符
spark SparkSessionbuilderappName(name)config(
sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()
# 初始化HiveContext
hive HiveContext(sparksparkContext)
# 启SparkSQL表连接支持
sparkconfset(sparksqlcrossJoinenabled true)

# 读取parquet文件数代码
# Parquet面分析型业务列式存储格式TwitterCloudera合作开发AWS中
# parquet文件数存储AWS S3
# AWSS3作数存储服务S3 全名 Simple Storage Service简便存储服务
df1 sparkreadload(
path'
format'parquet' headerTrue)

# 读取CSV文件数代码
# 边CSV文件作手工交换文件标准
# 原csv格式简单数字类型数字符串存储精度保证
df2 sparkreadload(
path'
format'csv' headerTrue)

# 读取Hive表视图数代码
df3 hivesql(
select
*
from <数库名><表名>)

# 次表数集进行数存缓存(第条Spark优化策略)
# 样话pyspark代码次调数时候Spark会重复读取相文件数
df4 sparkreadload(
path'
format'parquet' headerTrue)cache()

# 刚数集命名便放入SparkSQL编写查询语句
df1createOrReplaceTempView(DF1)

df2createOrReplaceTempView(DF2)

df3createOrReplaceTempView(DF3)

df4createOrReplaceTempView(DF4)

# 创建SparkSQL数集代码
# 果数量较业务逻辑复杂话数时缓存存储服务磁盘
# 避免pyspark代码SparkSQL调里SparkSQL数集时候
# 里SparkSQL数集重复运行计算逻辑节约计算资源(第二条Spark优化策略)
df5 sparksql(
SELECT

from DF1 AS D1
LEFT JOIN DF2 AS D2
ON
LEFT JOIN DF4 AS D4
ON
WHERE
)persist()
# countAction算子会触发sparksubmit事件前persist()缓存操作刻生效
# count()操作persist()缓存操作会Action算子处程序结束处生效
df5count()
df5createOrReplaceTempView(DF5)

# 创建SparkSQL数集代码
df6 sparksql(
SELECT

from DF5 AS D5
LEFT JOIN DF3 AS D3
ON
LEFT JOIN DF4 AS D4
ON
WHERE
)

# 写入结果数集parquet文件
df6writeparquet(
path'
modeoverwrite)

# 释放磁盘缓存
df5unpersist()

# sparkContext停止
sparkstop()


PySparkMySQL导出数parquet文件
from pysparksql import SparkSession

spark SparkSessionBuilder()getOrCreate()
urljdbcmysql[<端口号>]useTimezonefalse&serverTimezoneUTC
mysql_df sparkreadjdbc(url url table<查询语句> properties{user<户名> password <密码> database <选择数库>})
mysql_dfwriteparquet(
path'<存储路径><表名>'
modeoverwrite)

sparkstop()

PySparkTeradata导出数parquet文件
from pysparksql import SparkSession

spark SparkSessionBuilder()getOrCreate()
url jdbcteradata{ip} \
DATABASE{database} \
DBS_PORT{dbs_port} \
LOGMECH{LDAP} \
CHARSET{ASCII|UTF8} \
COLUMN_NAMEON \
MAYBENULLONformat(ipdatabasedbs_port<端口号>)
teradata_df sparkreadjdbc (urlurl
table<查询语句>
properties{
user <户名>
password <密码>
driver comteradatajdbcTeraDriver
})
teradata_dfwriteparquet(
path'<存储路径><表名>'
modeoverwrite)

sparkstop()

PySparkParquent文件写入Hive表
from pysparksql import SparkSession
# 开动态分区
sparksql(set hiveexecdynamicpartitionmode nonstrict)
sparksql(set hiveexecdynamicpartitiontrue)

# 普通hivesql写入分区表
sparksql(
insert overwrite table aida_aipurchase_dailysale_hive
partition (saledate)
select productid propertyid processcenterid saleplatform sku poa salecount saledate
from szy_aipurchase_tmp_szy_dailysale distribute by saledate
)

# 者次重建分区表方式
jdbcDFwritemode(overwrite)partitionBy(saledate)insertInto(aida_aipurchase_dailysale_hive)
jdbcDFwritesaveAsTable(aida_aipurchase_dailysale_hive None append partitionBy'saledate')

# 写分区表简单导入hive表
jdbcDFwritesaveAsTable(aida_aipurchase_dailysale_for_ema_predict None overwrite None)

PySpark读取HiveSQL查询数写入parquet文件
from pysparksql import HiveContext SparkSession

spark SparkSessionbuilderappName(<配置名称>)config(sparkdebugmaxToStringFields 100)enableHiveSupport()getOrCreate()

hive HiveContext(sparksparkContext)
df hivesql(
)
dfwriteparquet(<存储路径><文件名>
mode'overwrite')

PySpark获取Dataframe采样数保存CSV文件
from pysparksql import SparkSession

spark SparkSessionBuilder()getOrCreate()

df …

dfsample(False0122345)repartition(1)writecsv(path'<存储路径><文件名>csv' headerTrue sep mode'overwrite')

PySpark连接MySQL数库插入数
<表名>_df sparkcreateDataFrame([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>)(<值2n+1><值2n+2><值2n+3><值3n>)(<值mn+1><值mn+2><值mn+3><值(m+1)n>)]['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])

<表名>_dfwritejdbc(url url table<表名>'append' properties{user<户名> password <密码> database <选择数库>})

PySpark连接Teradata数库插入数
<表名>_df sparkcreateDataFrame([(<值1><值2><值3><值n>)(<值n+1><值n+2><值n+3><值2n>)(<值2n+1><值2n+2><值2n+3><值3n>)(<值mn+1><值mn+2><值mn+3><值(m+1)n>)]['<字段名1>''<字段名2>''<字段名3>''<字段名n>'])
url jdbcteradata{ip} \
DATABASE{database} \
DBS_PORT{dbs_port} \
LOGMECH{LDAP} \
CHARSET{ASCII|UTF8} \
COLUMN_NAMEON \
MAYBENULLONformat(ipdatabasedbs_port<端口号>)
<表名>_dfwritejdbc (urlurl
table<表名>'append'
properties{
user <户名>
password <密码>
driver comteradatajdbcTeraDriver
})

PySpark遍历Dataframe行
表(data1)某数宽表(data2)列缺失数处理方法配置表中COLUMN_NAME数宽表特征名称NULL_PROCESS_METHON特征列缺失数处置办法假设处理方式4种:dropzeromeanother

需求
遍历配置表(data1)COLUMN_NAME获取相应缺失值处理方法(NULL_PROCESS_METHON)然应数宽表(data2)应特征列
实现
rows data1collect()
cols data1columns
cols_len len(data1columns) 1

for row in rows
row_data_temp []
for idxcol in enumerate(cols)
row_data_tempappend(row[col])
if idx cols_len
row_data row_data_temp
print(row_data[4])
· 1
· 2
· 3
· 4
· 5
· 6
· 7
· 8
· 9
· 10
· 11
班天头昏脑涨顿操作实现
直第二天午新天开始思维格外敏捷检查代码发现艹昨晚写?(黑问号…)
二话说直接改掉
rows data1collect()

for row in rows
print(row[4])
· 1
· 2
· 3
· 4
解collect
DataFrame调collect方法[ [………] [………] [………] … ]形式存储数外层数组行数(Row)里层数组行中列(Column)数调collect方法生成数二维数组数组里面数元素数组遍历方式二维数遍历方式
注意collect方法性消耗非常会数加载驱动程序存中般适合数量进行collect操作果数量dataframecollect() 容易导致存溢出改map

PySpark移动Parquet文件目录
import subprocess

subprocesscheck_call(
'mv <存储路径1><表名>[<分区字段><分区值>] <存储路径2><表名>[<分区字段><分区值>]' shellTrue)

PySpark复制Parquet文件目录
import subprocess

subprocesscheck_call(cp r <存储路径1><表名>[<分区字段><分区值>] <存储路径2><表名>[<分区字段><分区值>] shellTrue)

PySpark删Parquet文件目录
import subprocess

subprocesscheck_call('rm r <存储路径1><表名>[<分区字段><分区值>]'shellTrue)

PySpark修改Hive指存储路径
from pysparksql import HiveContext

hive HiveContext(sparksparkContext)
hivesql(alter table [`<库名>`]`<表名>` set location '<存储路径><表名>[<分区列名><分区值>]')

PySpark显示HDFS路径文件
import subprocess
 
subprocesscheck_call('HADOOP_HOMEbinhdfs dfs ls hdfs<存储路径><表名>'shellTrue)

PySpark显示普通Hive表容量(GB)
import subprocess
 
subprocesscheck_call('HADOOP_HOMEbinhadoop fs du hdfs<存储路径><表名>|awk '{ SUM + 1 } END { print SUM(1024*1024*1024)}'shellTrue)

PySpark显示Hive分区表容量(GB)
import subprocess
 
subprocesscheck_call('HADOOP_HOMEbinhadoop fs ls hdfs<存储路径><表名><分区列名><分区值> | awk F ' ' '{print 5}'|awk '{a+1}END {print a(1024*1024*1024)}'shellTrue)

PySpark显示HDFS目录子目录容量
import subprocess
 
subprocesscheck_call('hdfs dfs du h hdfs<存储路径>}'shellTrue)

PySpark调SqoopHDFS导入Hive表
import subprocess
from pysparksql import HiveContext

hive HiveContext(sparksparkContext)

hivesql('drop table if exists [`<库名>`]`<表名>` purge')
 
hivesql(
CREATE TABLE [`<库名>`]`<表名>`(
`<字段名1>` <类型1>
`<字段名2>` <类型2>
`<字段名3>` <类型3>

`<字段名n>` <类型n>)
[PARTITIONED BY (
`<分区字段1>` <分区字段类型1>
`<分区字段2>` <分区字段类型2>
`<分区字段3>` <分区字段类型3>

`<分区字段n>` <分区字段类型n>
)]
ROW FORMAT SERDE
'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
STORED AS PARQUET)

subprocesscheck_call('HADOOP_HOMEbinsqoop import connect jdbcmysql<数库名> username passwordfile password> table <表名> hiveimport m 1 hivetable ')shellTrue)

HiveQLparquet文件创建Hive表
DROP TABLE IF EXISTS `<库名>``<表名>`
CREATE EXTERNAL TABLE [IF NOT EXISTS] `<库名>``<表名>` (
`<字段1>` <类型1>
`<字段2>` <类型2>
`<字段3>` <类型3>

`<字段n>` <类型n>
)
ROW FORMAT SERDE
'orgapachehadoophiveqlioparquetserdeParquetHiveSerDe'
STORED AS INPUTFORMAT
'orgapachehadoophiveqlioparquetMapredParquetInputFormat'
OUTPUTFORMAT
'orgapachehadoophiveqlioparquetMapredParquetOutputFormat'
LOCATION
'<存储路径><表名>[<分区字段><分区值>]'

HiveQLHive表创建Hive视图
CREATE OR REPLACE VIEW `<库名>``<表名>` AS


HiveQL格式化显示Hive查询结果数
set hivecliprintheadertrue
set hiveresultsetuseuniquecolumnnamestrue

Hive导出Hive查询结果数CSV文件
hive e set hivecliprintheadertrue <查询语句> | sed 's[\t]g' > <文件名>csv

set hivecliprintheadertrue表头输出
sed s[\t]g’ \t换成
shell里印容输出文件

HiveQL显示Hive表
前数库显示表
show tables
指定数库显示表
show tables in <数库名>
前数库显示带关键字表
show tables *<关键字>*’
指定数库显示关键字表
show tables in <数库名> *<关键字>*’

HiveQL显示Hive数库
show databases

Shell带日期参数运行HQL脚
vim <文件名>sh
#binbash

hql_path1
if [ 2 ]then
batch_date(TZAsiaShanghai date d @`date +s` +Ymd) #中国时间前日期
else
batch_date2
fi

delete_partition_date`date d batch_date 7 day +Ymd` #删7天前数
hql_file_name`basename hql_path`
log_file_name{hql_file_name*}_`date +s`log
log_file_pathloglog_file_name
partition_date`date d batch_date +Ymd` #YYYYMMDD格式日期

hive hivevar current_datepartition_date hivevar filter_datebatch_date hivevar drop_datedelete_partition_date f hql_path >>log_file_path 2>&1

vim hql
全量写入Hive数:
INSERT OVERWRITE TABLE <库名><表名> partition(<时间分区字段(格式YYYYMMDD)>'{current_date}')
SELECT
<字段名1>
<字段名2>
<字段名3>

<字段名n>
FROM <库名><表名>

alter table <库名><表名> drop partition (<时间分区字段(格式YYYYMMDD)><'{drop_date}' )

增量写入Hive数:
INSERT OVERWRITE TABLE <库名><表名> partition(<时间分区字段(格式YYYYMMDD)>'{current_date}')
SELECT
<字段名1>
<字段名2>
<字段名3>

<字段名n>
FROM <库名><表名>
WHERE substr(<记录创建时间戳字段(格式YYYYMMDD)>110) '{filter_date}' OR substr(<记录修改时间戳字段(格式YYYYMMDD)>110) '{filter_date}'

执行方法:
<文件名>sh hql

HiveQL更新视图指天表数
vim hql

drop view <库名><视图名>
CREATE VIEW <库名><视图名>
AS
SELECT
<字段名1>
<字段名2>
<字段名3>

<字段名n>
FROM <库名><表名>
WHERE <时间分区字段(格式YYYYMMDD)> '{current_date}'

HiveQL修改Hive表指存储文件
alter table <库名><表名>
set location
'hdfs<表名>'

Shell清HDFS里数
HADOOP_HOMEbinhadoop fs rm r <表名>

HADOOP_HOMEbinhadoop fs rm <表名>partm*

Shell查HDFS数
HADOOP_HOMEbinhadoop fs cat <表名>partm*

Sqoop显示MySQL中数库
HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username password

HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username P
Enter password

echo n > <存储目录>password
chmod 400 password
HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile <存储目录>password

echo n > <存储目录>password
chmod 400 password
HADOOP_HOMEbinhadoop dfs put <存储目录>password
HADOOP_HOMEbinhadoop dfs chmod 400 password
rm <存储目录>password
rm remove writeprotected regular file `password' y
HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile password

SqoopMySQL数库导入HDFS
指定目录:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password table <表名> [m <行度>] [fieldsterminatedby <分隔符>]

指定目录:
sqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

增量导入:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

设置job增量导入:
HADOOP_HOMEbinsqoop job create job_import_<表名> import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]
HADOOP_HOMEbinsqoop job exec job_import_<表名>

删job:
HADOOP_HOMEbinsqoop job delete job_import_<表名>

查前job:
HADOOP_HOMEbinsqoop job list

查某具体job信息:
HADOOP_HOMEbinsqoop job show job_import_<表名>

全量导入带gzip压缩数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> z [m <行度>] [fieldsterminatedby <分隔符>]

全量导入空值换指定字符数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> deletetargetdir nullnonstring <空值换指定字符> nullstring <空值换指定字符> [m <行度>] [fieldsterminatedby <分隔符>]

全量导入指定字段数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> columns <字段1><字段2><字段3>…<字段n> targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

全量带筛选条件导入数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> where <筛选条件> targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

全量查询导入数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> query <查询> splitby <查询中表名>{<键>|<外键>} targetdir <级目录><表名> deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

全量导入指定数导入数格式:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> assequencefile deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

全量导入指定 Map 务发度:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> nummappers deletetargetdir [m <行度>] [fieldsterminatedby <分隔符>]

增量导入带gzip压缩数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> z [m <行度>] [fieldsterminatedby <分隔符>]

增量导入空值换指定字符数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> nullnonstring <空值换指定字符> nullstring <空值换指定字符> [m <行度>] [fieldsterminatedby <分隔符>]

增量导入指定字段数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> columns <字段1><字段2><字段3>…<字段n> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

增量带筛选条件导入数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> where <筛选条件> targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

增量查询导入数:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> query <查询> splitby <查询中表名>{<键>|<外键>} targetdir <级目录><表名> append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

增量导入指定数导入数格式:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> assequencefile append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

增量导入指定 Map 务发度:
HADOOP_HOMEbinsqoop import connect 'jdbcmysql<数库名>useUnicodetrue&characterEncodingutf8' username passwordfile password> table <表名> targetdir <级目录><表名> nummappers append checkcolumn '<键字段名>' incremental append lastvalue <导入键数值> [m <行度>] [fieldsterminatedby <分隔符>]

SqoopHDFS数库导入MySQL
HADOOP_HOMEbinsqoop export connect jdbcmysql<数库名> username passwordfile password> table <表名> exportdir hdfs<级目录><表名>

Sqoop显示MySQL中数库
HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username password

HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username P
Enter password

echo n > <存储目录>password
chmod 400 password
HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile <存储目录>password

echo n > <存储目录>password
chmod 400 password
HADOOP_HOMEbinhadoop dfs put <存储目录>password
HADOOP_HOMEbinhadoop dfs chmod 400 password
rm <存储目录>password
rm remove writeprotected regular file `password' y
HADOOP_HOMEbinsqoop listdatabases connect jdbcmysql username passwordfile password

Teradata支持数类型

MySQL支持数类型
名称
类型
说明
INT
整型
4字节整数类型范围约+21亿
BIGINT
长整型
8字节整数类型范围约+922亿亿
REAL
浮点型
4字节浮点数范围约+1038
DOUBLE
浮点型
8字节浮点数范围约+10308
DECIMAL(MN)
高精度数
户指定精度数例DECIMAL(2010)表示20位中数10位通常财务计算
CHAR(N)
定长字符串
存储指定长度字符串例CHAR(100)总存储100字符字符串
VARCHAR(N)
变长字符串
存储变长度字符串例VARCHAR(100)存储0~100字符字符串
BOOLEAN
布尔类型
存储True者False
DATE
日期类型
存储日期例20180622
TIME
时间类型
存储时间例122059
DATETIME
日期时间类型
存储日期+时间例20180622 122059

Hive支持数类型
1 基数类型
Hive数类型
Java 数类型
长度
例子
TINYINT
byte
1byte 符号整数
20
SMALINT
short
2byte 符号整数
20
INT
int
4byte 符号整数
20
BIGINT
long
8byte 符号整数
20
BOOLEAN
boolean
布尔类型true 者 false
TRUE FALSE
FLOAT
float
单精度浮点数
314159
DOUBLE
double
双精度浮点数
314159
STRING
string
字符系列指定字符集单引号者双引号
now is the time’ for all good men
TIMESTAMP
 
时间类型
 
BINARY
 
字节数组
 
Hive String 类型相数库 varchar 类型该类型变字符串声明中存储少字符理存储 2GB 字符数
2 集合数类型
数类型
描述
语法示例
STRUCT
c 语言中 struct 类似通点符号访问元素容例果某列数类型 STRUCT{first STRINGlast STRING}第 1 元素通字段first 引
struct()
MAP
MAP 组键值元组集合数组表示法访问数例果某列数类型 MAP中键>值’first’>’John’’last’>’Doe’通字段名[last’]获取元素
map()
ARRAY
数组组具相类型名称变量集合变量称数组元素数组元素编号编号零开始例数组值[John’ Doe’]第 2 元素通数组名[1]进行引
Array()
Hive 三种复杂数类型 ARRAYMAP STRUCTARRAY MAP Java 中Array Map 类似 STRUCT C 语言中 Struct 类似封装命名字段集合复杂数类型允许意层次嵌套

Parquet文件存储格式
Parquet仅仅种存储格式语言台关需种数处理框架绑定目前够Parquet适配组件包括面出基通常查询引擎计算框架已适配方便序列化工具生成数转换成Parquet格式
· 查询引擎 Hive Impala Pig Presto Drill Tajo HAWQ IBM Big SQL
· 计算框架 MapReduce Spark Cascading Crunch Scalding Kite
· 数模型 Avro Thrift Protocol Buffers POJOs
项目组成
Parquet项目子项目组成
· parquetformat项目java实现定义Parquet元数象Parquet元数Apache Thrift进行序列化存储Parquet文件尾部
· parquetformat项目java实现包括模块包括实现读写Parquet文件功提供组件适配工具例Hadoop InputOutput FormatsHive Serde(目前Hive已带Parquet)Pig loaders等
· parquetcompatibility项目包含编程语言间(JAVACC++)读写文件测试代码
· parquetcpp项目读写Parquet文件C++库
图展示Parquet组件层次交互方式

· 数存储层定义Parquet文件格式中元数parquetformat中定义包括Parquet原始类型定义Page类型编码类型压缩类型等等
· 象转换层完成象模型Parquet部数模型映射转换Parquet编码方式striping and assembly算法
· 象模型层定义读取Parquet文件容层转换包括AvroThriftPB等序列化格式Hive serde等适配帮助家理解Parquet提供orgapacheparquetexample包实现java象Parquet文件转换
数模型
Parquet支持嵌套数模型类似Protocol Buffers数模型schema包含字段字段包含字段字段三属性:重复数数类型字段名重复数三种:required(出现1次)repeated(出现0次次)optional(出现0次1次)字段数类型分成两种:group(复杂类型)primitive(基类型)例Dremel中提供Documentschema示例定义:
message Document {
required int64 DocId
optional group Links {
repeated int64 Backward
repeated int64 Forward
}
repeated group Name {
repeated group Language {
required string Code
optional string Country
}
optional string Url
}
}
Schema转换成树状结构根节点理解repeated类型图 

出Schema中基类型字段叶子节点Schema中存6叶子节点果样Schema转换成扁式关系模型理解该表包含六列Parquet中没MapArray样复杂数结构通repeatedgroup组合实现样需求包含6字段表中字段条记录中出现次数:
DocId int64 出现次
LinksBackward int64 出现意次果出现0次需NULL标识
LinksForward int64
NameLanguageCode string
NameLanguageCountry string
NameUrl string
表中存出现意次列列需标示出现次者等NULL情况StripingAssembly算法实现
StripingAssembly算法
文介绍Parquet数模型Document中存非required列Parquet条记录数分散存储列中组合列值组成条记录StripingAssembly算法决定该算法中列值包含三部分:valuerepetition leveldefinition level
Repetition Levels
支持repeated类型节点写入时候该值等前面值层节点享读取时候根该值推导出层需创建新节点例样schema两条记录
message nested {
repeated group leve1 {
repeated string leve2
}
}
r1[[abc] [defg]]
r2[[h] [ij]]
计算repetition level值程:
· valuea条记录开始前面值(已没值)根节点(第0层)享repeated level0
· valueb前面值享level1节点level2节点享repeated level2
· 理valuec repeated level2
· valued前面值享根节点(属相记录)level1节点享repeated level1
· valueh前面值属条记录享节点repeated level0
根分析value需记录repeated level值:

读取时候序读取值然根repeated level创建象读取valuea时repeated level0表示需创建新根节点(新记录)valueb时repeated level2表示需创建新level2节点valued时repeated level1表示需创建新level1节点列读取完成创建条新记录例中读取文件构建条记录结果:

出repeated level0表示条记录开始repeated level值针路径repeated类型节点计算该值时候忽略非repeated类型节点写入时候理解该节点路径repeated节点享读取时候理解需层创建新repeated节点样话列repeated level值等路径repeated节点数(包括根节点)减repeated level处够存储更加紧凑编码方式节省存储空间
Definition Levels
repeated level构造出记录什需definition levels呢?repeatedoptional类型存条记录中某列没值假设记录样值会导致该属条记录值做前记录部分造成数错误种情况需占位符标示种情况
definition level值仅仅空值效表示该值路径第层开始未定义非空值没意义非空值叶子节点定义父节点肯定定义总等该列definition levels例面schema
message ExampleDefinitionLevel {
optional group a {
optional group b {
optional string c
}
}
}
包含列abc列节点optional类型c定义时ab肯定已定义c未定义时需标示出层开始时未定义面值:

definition level需考虑未定义值repeated类型节点父节点已定义该节点必须定义(例Document中DocId条记录该列必须值样Language节点定义Code必须值)计算definition level值时忽略路径required节点样减definition level值优化存储
完整例子
节Dremel文中Document示例定两值r1r2展示计算repeated leveldefinition level程里未定义值记录NULLR表示repeated levelD表示definition level

首先DocuId列r1DocId10记录开始已定义R0D0样r2中DocId20R0D0
LinksForward列r1中未定义Links已定义该记录中第值R0D1r1中该列两值value110R0(记录中该列第值)D2(该列definition level)
NameUrl列r1中三值分url1’httpAr1中该列第值定义R0D2value2’httpB值value1Name层相R1D2value3NULL值value2Name层相R1未定义Name层定义D1r2中该列值value3’httpCR0D2
NameLanguageCode列r1中4值value1’enus’r1中第值已定义R0D2(Coderequired类型列repeated level值等2)value2’en’value1Language节点享R2D2value3NULL未定义前值Name节点享Name节点已定义R1D1value4’engb’前值Name层享R1D2r2中该列值未定义Name层已定义R0D1
 
Parquet文件格式
Parquet文件二进制方式存储直接读取文件中包括该文件数元数Parquet格式文件解析HDFS文件系统Parquet文件中存概念
· HDFS块(Block):HDFS副单位HDFS会Block存储文件维护分散机器副通常情况Block256M512M等
· HDFS文件(File):HDFS文件包括数元数数分散存储Block中
· 行组(Row Group):行数物理划分单元行组包含定行数HDFS文件中少存储行组Parquet读写时候会整行组缓存存中果行组存决定例记录占空间较Schema行组中存储更行
· 列块(Column Chunk):行组中列保存列块中行组中列连续存储行组文件中列块中值相类型列块算法进行压缩
· 页(Page):列块划分页页编码单位列块页编码方式
文件格式
通常情况存储Parquet数时候会Block设置行组般情况Mapper务处理数单位Block样行组Mapper务处理增务执行行度Parquet文件格式图示

图展示Parquet文件容文件中存储行组文件首位该文件Magic Code校验否Parquet文件Footer length文件元数通该值文件长度计算出元数偏移量文件元数中包括行组元数信息该文件存储数Schema信息文件中行组元数页开始会存储该页元数Parquet中三种类型页:数页字典页索引页数页存储前行组中该列值字典页存储该列值编码字典列块中包含字典页索引页存储前行组该列索引目前Parquet中支持索引页面版中增加
执行MR务时候存Mapper务输入Parquet文件情况Mapper通InputSplit标示处理文件范围果InputSplit跨越Row GroupParquet够保证Row Group会Mapper务处理
映射推(Project PushDown)
说列式存储优势映射推突出意味着获取表中原始数时需扫描查询中需列列值连续存储分区取出列值实现TableScan算子避免扫描整表文件容
Parquet中原生支持映射推执行查询时候通Configuration传递需读取列信息列必须Schema子集映射次会扫描Row Group数然次性该Row Group里需列Cloumn Chunk读取存中次读取Row Group数够降低机读次数外Parquet读取时候会考虑列否连续果某需列存储位置连续次读操作列数读取存
谓词推(Predicate PushDown)
数库类查询系统中常优化手段谓词推通滤条件底层执行减少层交互数量提升性例select count(1) from A Join B on Aid Bid where Aa > 10 and Bb < 100SQL查询中处理Join操作前需首先AB执行TableScan操作然进行Join执行滤计算聚合函数返回果滤条件Aa > 10Bb < 100分移A表TableScanB表TableScan时候执行降低Join操作输入数
行式存储列式存储滤条件读取条记录执行判断该记录否需返回调者Parquet做更进步优化优化方法时Row GroupColumn Chunk存储时候计算应统计信息包括该Column Chunk值值空值数通统计值该列滤条件判断该Row Group否需扫描外Parquet未会增加诸Bloom FilterIndex等优化数更加效完成谓词推
Parquet时候通两种策略提升查询性:1类似关系数库键需频繁滤列设置序样导入数时候会根该列序存储数样化利值值实现谓词推2减行组页样增加跳整行组性时需权衡压缩编码效率降带IO负载

相传统行式存储Hadoop生态圈年涌现出诸RCORCParquet列式存储格式性优势体现两方面:1更高压缩相类型数更容易针类型列高效编码压缩方式2更IO操作映射推谓词推减少部分必数扫描尤表结构较庞时候更加明显够带更查询性

图展示格式存储TPCHTPCDS数集中两表数文件出Parquet较二进制文件存储格式够更效利存储空间新版Parquet(20版)更加高效页存储方式进步提升存储空间

图展示TwitterImpala中格式文件执行TPCDS基准测试结果测试结果出Parquet较行式存储格式较明显性提升

图展示criteo公司Hive中ORCParquet两种列式存储格式执行TPCDS基准测试结果测试结果出数存储方面两种存储格式snappy压缩情况量中存储格式占空间相差查询结果显示Parquet格式稍ORC格式两者功优缺点Parquet原生支持嵌套式数结构ORC支持较差种复杂Schema查询相较差Parquet支持数修改ACIDORC提供支持OLAP环境少会单条数修改更批量导入
项目发展
2012年TwitterCloudera研发Parquet开始该项目直处高速发展中项目初贡献开源社区2013年Criteo公司加入开发Hive社区提交hive集成Parquetpatch(HIVE5783)Hive 013版正式加入Parquet支持越越查询引擎进行支持进步带动Parquet发展
目前Parquet正处20版迈进阶段新版中实现新Page存储格式针类型优化编码算法外丰富支持原始类型增加DecimalTimestamp等类型支持增加更加丰富统计信息例Bloon Filter够谓词推元数层完成
总结
文介绍种支持嵌套数模型列式存储系统Parquet作数系统中OLAP查询优化方案已种查询引擎原生支持部分高性引擎作默认文件存储格式通数编码压缩映射推谓词推功Parquet性较文件格式提升预见着数模型丰富Ad hoc查询需求Parquet会更广泛
Apache Airflow文档

拒绝条款: Apache Airflow正Apache Incubator赞助Apache软件基金会(ASF)进行孵化新接受项目需孵化直进步审查表明基础设施通信决策程已成功ASF项目致方式稳定然孵化状态定反映代码完整性稳定性确实表明该项目尚未ASF完全认
Apache Airflow 灵活扩展工作流动化调度系统编集理数百 PB 数流项目轻松编排复杂计算工作流通智调度数库赖关系理错误处理日志记录Airflow 动化单服务器规模集群资源理项目采 Python 编写具高扩展性够运行语言编写务允许常体系结构项目集成 AWS S3DockerKubernetesMySQLPostgres 等
airflow工作流实现务非循环图(DAG)airflow调度程序遵循指定赖关系时组工作程序执行您务丰富命令行实程序轻松DAG执行复杂操作丰富户界面您轻松生产环境中运行视化数道监控进度需时解决问题
工作流定义代码时变更易维护版化测试协作

悉Apache Airflow 目前正 200 组织包括 AdobeAirbnbAstronomerEtsyGoogleINGLyftNYC City PlanningPaypalPolideaQuboleQuizletRedditReplySolitaSquareTwitter 等

Airflow 允许工作流开发员轻松创建维护周期性调度运行工作流(环图成DAGs)工具Airbnb中工作流包括数存储增长分析Email发送AB测试等等跨越部门例台拥 HivePrestoMySQLHDFSPostgresS3交互力提供钩子系统拥扩展性命令行界面该工具提供    基Web户界面您视化道赖关系监控进度触发务等 
Airflow 包含组件:
· 元数库(MySQLPostgres)
· 组Airflow工作节点
· 调节器(RedisRabbitMQ)
· Airflow Web服务器
截图:


道定义示例:

Code that goes along with the Airflow tutorial located at
httpsgithubcomairbnbairflowblobmasterairflowexample_dagstutorialpy

from airflow import DAG
from airflowoperatorsbash_operator import BashOperator
from datetime import datetime timedelta


default_args {
'owner' 'airflow'
'depends_on_past' False
'start_date' datetime(2015 6 1)
'email' ['airflow@airflowcom']
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
# 'queue' 'bash_queue'
# 'pool' 'backfill'
# 'priority_weight' 10
# 'end_date' datetime(2016 1 1)
}

dag DAG('tutorial' default_argsdefault_args)

# t1 t2 and t3 are examples of tasks created by instantiating operators
t1 BashOperator(
task_id'print_date'
bash_command'date'
dagdag)

t2 BashOperator(
task_id'sleep'
bash_command'sleep 5'
retries3
dagdag)

templated_command
{ for i in range(5) }
echo {{ ds }}
echo {{ macrosds_add(ds 7)}}
echo {{ paramsmy_param }}
{ endfor }


t3 BashOperator(
task_id'templated'
bash_commandtemplated_command
params{'my_param' 'Parameter I passed in'}
dagdag)

t2set_upstream(t1)
t3set_upstream(t1)

apache airflow
Watch 623Star 10710Fork 3756
Apache Airflow — More
httpsairflowapacheorg

master 分支 :20190119
载zip

原理
· 动态性:Airflow pipelines配置代码(Python)允许动态pipeline生成允许编写动态实例化道代码
· 扩展性:轻松定义运算符执行程序扩展库符合适合您环境抽象级
· 优雅:Airflow pipelines精益明确强Jinja模板引擎参数化脚置Airflow核心
· 扩展性:Airflow具模块化架构消息队列协调意数量workersAirflow已准备扩展限

Airflow数流解决方案务会数移动(务交换元数) AirflowSpark StreamingStorm空间OozieAzkaban更具性
工作流预计部分静态缓慢变化您工作流中务结构视数库结构稍微动态样预计Airflow工作流程运行期间起相似样明确工作单元连续性
Airflow环境安装配置
airflow支持运行windows环境搭建基docker环境
通安装方法
载安装Airflow
安装新稳定版Airflow简单方法pip:
pip install apacheairflow
安装Airflow支持s3postgres等额外功:
pip install apacheairflow[postgress3]
注意
GPL赖
One of the dependencies of Apache Airflow by default pulls in a GPL library (unidecode’) 默认情况Apache Airflow赖项会提取GPL库( unidecode)
果您关心GPL赖性请安装airflow前导出设置环境变量:export AIRFLOW_GPL_UNIDECODEyes
果您担心问题通设置export SLUGIFY_USES_TEXT_UNIDECODEyes强制非GPL库然继续进行正常安装请注意次升级时需指定项 注意果系统已存unidecode赖项
额外包
apacheairflow PyPI基软件包安装入门需容安装子包具体取决您环境中容例果您需Postgres连接必历安装postgresdevel yum软件包麻烦者您正分发应等效项
台Airflow会需额外赖关系operator进行条件导入
subpackage参数启功列表:
subpackage参数
安装命令
启功
all
pip install apacheairflow[all]
man里提Airflow功
all_dbs
pip install apacheairflow[all_dbs]
All databases integrations
async
pip install apacheairflow[async]
Async worker classes for Gunicorn
celery
pip install apacheairflow[celery]
CeleryExecutor
cloudant
pip install apacheairflow[cloudant]
Cloudant hook
crypto
pip install apacheairflow[crypto]
Encrypt connection passwords in metadata db
devel
pip install apacheairflow[devel]
Minimum dev tools requirements
devel_hadoop
pip install apacheairflow[devel_hadoop]
Airflow + dependencies on the Hadoop stack
druid
pip install apacheairflow[druid]
Druid related operators & hooks
gcp_api
pip install apacheairflow[gcp_api]
Google Cloud Platform hooks and operators (using googleapipythonclient)
github_enterprise
pip install apacheairflow[github_enterprise]
Github Enterprise auth backend
google_auth
pip install apacheairflow[google_auth]
Google auth backend
hdfs
pip install apacheairflow[hdfs]
HDFS hooks and operators
hive
pip install apacheairflow[hive]
All Hive related operators
jdbc
pip install apacheairflow[jdbc]
JDBC hooks and operators
kerberos
pip install apacheairflow[kerberos]
Kerberos integration for Kerberized Hadoop
ldap
pip install apacheairflow[ldap]
LDAP authentication for users
mssql
pip install apacheairflow[mssql]
Microsoft SQL Server operators and hook support as an Airflow backend
mysql
pip install apacheairflow[mysql]
MySQL operators and hook support as an Airflow backend The version of MySQL server has to be 564+ The exact version upper bound depends on version of mysqlclient package For example mysqlclient 1312 can only be used with MySQL server 564 through 57
password
pip install apacheairflow[password]
Password authentication for users
postgres
pip install apacheairflow[postgres]
PostgreSQL operators and hook support as an Airflow backend
qds
pip install apacheairflow[qds]
Enable QDS (Qubole Data Service) support
rabbitmq
pip install apacheairflow[rabbitmq]
RabbitMQ support as a Celery backend
redis
pip install apacheairflow[redis]
Redis hooks and sensors
s3
pip install apacheairflow[s3]
S3KeySensor S3PrefixSensor
samba
pip install apacheairflow[samba]
Hive2SambaOperator
slack
pip install apacheairflow[slack]
SlackAPIPostOperator
ssh
pip install apacheairflow[ssh]
SSH hooks and Operator
vertica
pip install apacheairflow[vertica]
Vertica hook support as an Airflow backend
启动Airflow数库
您运行务前Airflow需启动数库果您试验学Airflow您坚持默认SQLite选项果您想SQLite请查初始化数库端设置数库
配置完成您需先初始化数库然运行务:
airflow initdb
Airflow环境安装(Docker)
第1步:安装docker
通URL载安装docker
httpsstoredockercomeditionscommunitydockercedesktopwindows
第2步:载镜启动airflow服务
输入命令分载airflow镜启动airflow容器
airflow容器启动通URL:httplocalhost8080
访问airflow admin console
docker pull puckeldockerairflow
docker run d p 80808080 e LOAD_EXy puckeldockerairflow
Airflow环境配置(Docker)
查容器基信息
docker ps  # list running container
docker stop # stop container
docker start # start container
docker rm # remove container
登录container shell
docker exec it binbash
载container中airflow配置文件
指定container中airflow配置文件载前目录中
docker cp usrlocalairflowairflowcfg
传container DAG文件
docker exec mkdir usrlocalairflowdags # need to create the dags folder at the first time
 
docker cp testingpy usrlocalairflowdagstestingpy
修改airflowcfg配置启动airflow scheduler
默认采LocalExecutor方式默认没启动airflow scheduler需进行配置启动scheduler台服务
container中载airflow配置文件(usrlocalairflowairflowcfg)
修改配置文件中参数
max_threads 1
命令启动airflow scheduler
docker exec airflow scheduler D

Airflow环境安装(Windows 10)
开命令行提示符输入 pip install apacheairflow会报错找ssl模块参考解决教程:WIN10 Anaconda安装Python3pip时报没ssl模块错误 
错误解决继续输入 pip install apacheairflow

报错Microsoft Visual C++ 140 is required

原Windows环境版问题需官网载新安装果安装生效原Microsoft NET Framework版太旧重新官网载新安装
Visual C++ Build Tools 2015 Microsoft
 httpwwwlfduciedu~gohlkepythonlibs#twisted 载twisted应版whl文件cp面python版amd64代表64位运行命令: pip install 载whl文件名
果行话分享安装器进行安装:visualcppbuildtools_fullzip
解决问题继续安装airflow

报错VS140 linkexe failed with exit status 1158

问题关键:
Visual Studio can't build due to rcexe
通步骤解决问题:
1 环境变量中增加文件路径PATH 环境变量中:
C\Program Files (x86)\Windows Kits\10\bin\x64
2 文件rcexercdlldll 路径C\Program Files (x86)\Windows Kits\81\bin\x86 复制路径C\Program Files (x86)\Microsoft Visual Studio 140\VC\bin
3 重新命令行提示符里运行pip install apacheairflow命令




项目
历史
Airflow项目2014年10月AirbnbMaxime Beauchemin启动 第次提交时开源2015年6月宣布正式加入Airbnb Github
该项目2016年3月加入Apache Software Foundation孵化计划
2019年1月8日Apache 软件基金会宣布Apache Airflow 已成功孵化毕业成基金会新顶级项目
公告址:
httpblogsapacheorgfoundationentrytheapachesoftwarefoundationannounces44
提交者
· @mistercrunch (Maxime Max Beauchemin)
· @r39132 (Siddharth Sid Anand)
· @criccomini (Chris Riccomini)
· @bolkedebruin (Bolke de Bruin)
· @artwr (Arthur Wiedmer)
· @jlowin (Jeremiah Lowin)
· @patrickleotardif (Patrick Leo Tardif)
· @aoen (Dan Davydov)
· @syvineckruyk (Steven YvinecKruyk)
· @msumit (Sumit Maheshwari)
· @alexvanboxel (Alex Van Boxel)
· @saguziel (Alex Guziel)
· @joygao (Joy Gao)
· @fokko (Fokko Driesprong)
· @ash (Ash BerlinTaylor)
· @kaxilnaik (Kaxil Naik)
· @fengtao (Tao Feng)
关贡献者完整列表请查AirflowGithub贡献者页面:
资源链接
· Airflow’s official documentation
· Mailing list (send emails to devsubscribe@airflowincubatorapacheorg andor commitssubscribe@airflowincubatorapacheorg to subscribe to each)
· Issues on Apache’s Jira
· Gitter (chat) Channel
· More resources and links to Airflow related content on the Wiki
路线图
Please refer to the Roadmap on the wiki
许证

Apache License
Version 20 January 2004
httpwwwapacheorglicenses

TERMS AND CONDITIONS FOR USE REPRODUCTION AND DISTRIBUTION

1 Definitions

License shall mean the terms and conditions for use reproduction
and distribution as defined by Sections 1 through 9 of this document

Licensor shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License

Legal Entity shall mean the union of the acting entity and all
other entities that control are controlled by or are under common
control with that entity For the purposes of this definition
control means (i) the power direct or indirect to cause the
direction or management of such entity whether by contract or
otherwise or (ii) ownership of fifty percent (50) or more of the
outstanding shares or (iii) beneficial ownership of such entity

You (or Your) shall mean an individual or Legal Entity
exercising permissions granted by this License

Source form shall mean the preferred form for making modifications
including but not limited to software source code documentation
source and configuration files

Object form shall mean any form resulting from mechanical
transformation or translation of a Source form including but
not limited to compiled object code generated documentation
and conversions to other media types

Work shall mean the work of authorship whether in Source or
Object form made available under the License as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below)

Derivative Works shall mean any work whether in Source or Object
form that is based on (or derived from) the Work and for which the
editorial revisions annotations elaborations or other modifications
represent as a whole an original work of authorship For the purposes
of this License Derivative Works shall not include works that remain
separable from or merely link (or bind by name) to the interfaces of
the Work and Derivative Works thereof

Contribution shall mean any work of authorship including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner For the purposes of this definition submitted
means any form of electronic verbal or written communication sent
to the Licensor or its representatives including but not limited to
communication on electronic mailing lists source code control systems
and issue tracking systems that are managed by or on behalf of the
Licensor for the purpose of discussing and improving the Work but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as Not a Contribution

Contributor shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work

2 Grant of Copyright License Subject to the terms and conditions of
this License each Contributor hereby grants to You a perpetual
worldwide nonexclusive nocharge royaltyfree irrevocable
copyright license to reproduce prepare Derivative Works of
publicly display publicly perform sublicense and distribute the
Work and such Derivative Works in Source or Object form

3 Grant of Patent License Subject to the terms and conditions of
this License each Contributor hereby grants to You a perpetual
worldwide nonexclusive nocharge royaltyfree irrevocable
(except as stated in this section) patent license to make have made
use offer to sell sell import and otherwise transfer the Work
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted If You
institute patent litigation against any entity (including a
crossclaim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed

4 Redistribution You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium with or without
modifications and in Source or Object form provided that You
meet the following conditions

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files and

(c) You must retain in the Source form of any Derivative Works
that You distribute all copyright patent trademark and
attribution notices from the Source form of the Work
excluding those notices that do not pertain to any part of
the Derivative Works and

(d) If the Work includes a NOTICE text file as part of its
distribution then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file excluding those notices that do not
pertain to any part of the Derivative Works in at least one
of the following places within a NOTICE text file distributed
as part of the Derivative Works within the Source form or
documentation if provided along with the Derivative Works or
within a display generated by the Derivative Works if and
wherever such thirdparty notices normally appear The contents
of the NOTICE file are for informational purposes only and
do not modify the License You may add Your own attribution
notices within Derivative Works that You distribute alongside
or as an addendum to the NOTICE text from the Work provided
that such additional attribution notices cannot be construed
as modifying the License

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use reproduction or distribution of Your modifications or
for any such Derivative Works as a whole provided Your use
reproduction and distribution of the Work otherwise complies with
the conditions stated in this License

5 Submission of Contributions Unless You explicitly state otherwise
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License without any additional terms or conditions
Notwithstanding the above nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions

6 Trademarks This License does not grant permission to use the trade
names trademarks service marks or product names of the Licensor
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file

7 Disclaimer of Warranty Unless required by applicable law or
agreed to in writing Licensor provides the Work (and each
Contributor provides its Contributions) on an AS IS BASIS
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND either express or
implied including without limitation any warranties or conditions
of TITLE NONINFRINGEMENT MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License

8 Limitation of Liability In no event and under no legal theory
whether in tort (including negligence) contract or otherwise
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing shall any Contributor be
liable to You for damages including any direct indirect special
incidental or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill
work stoppage computer failure or malfunction or any and all
other commercial damages or losses) even if such Contributor
has been advised of the possibility of such damages

9 Accepting Warranty or Additional Liability While redistributing
the Work or Derivative Works thereof You may choose to offer
and charge a fee for acceptance of support warranty indemnity
or other liability obligations andor rights consistent with this
License However in accepting such obligations You may act only
on Your own behalf and on Your sole responsibility not on behalf
of any other Contributor and only if You agree to indemnify
defend and hold each Contributor harmless for any liability
incurred by or claims asserted against such Contributor by reason
of your accepting any such warranty or additional liability
快速开始
安装快速直接
# airflow needs a home ~airflow is the default
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME~airflow

# install from pypi using pip
pip install apacheairflow

# initialize the database
airflow initdb

# start the web server default port is 8080
airflow webserver p 8080

# start the scheduler
airflow scheduler

# visit localhost8080 in the browser and enable the example dag in the home page
运行命令Airflow创建AIRFLOW_HOME文件夹airflowcfg文件默认值您快速手您AIRFLOW_HOMEairflowcfg中检查文件通Admin>Configuration菜单中UI检查文件 果systemd启动Web服务器PID文件存储AIRFLOW_HOMEairflowwebserverpidrunairflowwebserverpid中
开箱Airflow sqlite 数库法数库端进行行化应该快扩展SequentialExecutor配合该序仅序运行务实例然非常限允许您快速启动运行浏览 UI 命令行实程序
触发务实例命令 运行命令时您应该够example1 DAG中作业状态发生变化
# run your first task instance
airflow run example_bash_operator runme_0 20150101
# run a backfill over 2 days
airflow backfill example_bash_operator s 20150101 e 20150102
步什?
点开始您前教程部分获取更示例者果您已准备弄清楚请参阅操作指南部分
教程
教程您介绍基Airflow概念象编写第道时法
示例道定义
基道定义示例果起复杂请担心面逐行说明

Code that goes along with the Airflow tutorial located at
httpsgithubcomapacheincubatorairflowblobmasterairflowexample_dagstutorialpy

from airflow import DAG
from airflowoperatorsbash_operator import BashOperator
from datetime import datetime timedelta


default_args {
'owner' 'airflow'
'depends_on_past' False
'start_date' datetime(2015 6 1)
'email' ['airflow@examplecom']
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
# 'queue' 'bash_queue'
# 'pool' 'backfill'
# 'priority_weight' 10
# 'end_date' datetime(2016 1 1)
}

dag DAG('tutorial' default_argsdefault_args schedule_intervaltimedelta(days1))

# t1 t2 and t3 are examples of tasks created by instantiating operators
t1 BashOperator(
task_id'print_date'
bash_command'date'
dagdag)

t2 BashOperator(
task_id'sleep'
bash_command'sleep 5'
retries3
dagdag)

templated_command
{ for i in range(5) }
echo {{ ds }}
echo {{ macrosds_add(ds 7)}}
echo {{ paramsmy_param }}
{ endfor }


t3 BashOperator(
task_id'templated'
bash_commandtemplated_command
params{'my_param' 'Parameter I passed in'}
dagdag)

t2set_upstream(t1)
t3set_upstream(t1)
DAG定义文件
围绕着件事(说直观)Airflow Python脚实际配置文件DAG结构指定代码处定义实际务脚文文中运行务时间点运行工作者意味着该脚务间交叉通信请注意名XCom更高级功
时会DAG定义文件视进行实际数处理方 事实非该脚目定义DAG象需快速评估(秒分钟)调度程序定期执行反映更改(果话)
导入模块
Airflow道Python脚恰定义Airflow DAG象首先导入需库
# The DAG object we'll need this to instantiate a DAG
from airflow import DAG

# Operators we need this to operate
from airflowoperatorsbash_operator import BashOperator
默认参数
创建DAG务选择显式组参数传递务构造函数(变余)者(更)定义默认参数字典创建务时
from datetime import datetime timedelta

default_args {
'owner' 'airflow'
'depends_on_past' False
'start_date' datetime(2015 6 1)
'email' ['airflow@examplecom']
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
# 'queue' 'bash_queue'
# 'pool' 'backfill'
# 'priority_weight' 10
# 'end_date' datetime(2016 1 1)
}
关BaseOperator参数功更信息请参阅airflowmodelsBaseOperator文档
外请注意您轻松定义目参数集例子生产开发环境间进行设置
实例化DAG
需DAG象嵌入务里传递定义dag_id字符串作DAG唯标识符传递刚刚定义默认参数字典DAG定义1天schedule_interval
dag DAG(
'tutorial' default_argsdefault_args schedule_intervaltimedelta(days1))

实例化操作员象时生成务运算符实例化象称构造函数第参数task_id充务唯标识符
t1 BashOperator(
task_id'print_date'
bash_command'date'
dagdag)

t2 BashOperator(
task_id'sleep'
bash_command'sleep 5'
retries3
dagdag)
请注意BaseOperator继承运算符(retries)运算符特定参数(bash_command)通参数传递运算符构造函数构造函数调传递参数更简单请注意第二务中3重载retries参数
务优先规:
1 显式传递参数
2 default_args 字典中存值
3 operator默认值(果存)
务必须包含继承参数task_idowner否Airflow引发异常
Jinja代码模板
Airflow充分利Jinja Templating强功pipeline author(道作者)提供组置参数宏 Airflowpipeline author(道作者)提供定义参数宏模板hooks(钩子)
教程没涉Airflow中模板进行操作表面节目您知道功存您熟悉双花括号指常见模板变量: {{ ds }}(天日期戳)
templated_command
{ for i in range(5) }
echo {{ ds }}
echo {{ macrosds_add(ds 7) }}
echo {{ paramsmy_param }}
{ endfor }


t3 BashOperator(
task_id'templated'
bash_commandtemplated_command
params{'my_param' 'Parameter I passed in'}
dagdag)
请注意templated_command包含{ }块中代码逻辑引参数{{ ds }}调{{ macrosds_add(ds 7)}}中函数{{ paramsmy_param }}中引户定义参数
BaseOperator中params hook(钩子)允许您参数象字典传递模板 请花点时间解通模板参数my_param
Files can also be passed to the bash_command argument like bash_command'templated_commandsh' where the file location is relative to the directory containing the pipeline file (tutorialpy in this case) 文件作值传递bash_command参数例bash_command'templated_commandsh'中文件位置相包含pipeline(道)文件目录(例中tutorialpy)This may be desirable for many reasons like separating your script’s logic and pipeline code allowing for proper code highlighting in files composed in different languages and general flexibility in structuring pipelines 出许原例分离脚逻辑道代码允许语言组成文件中正确代码突出显示构造道般灵活性It is also possible to define your template_searchpath as pointing to any folder locations in the DAG constructor call template_searchpath定义指DAG构造函数调中文件夹位置
相 DAG 构造函数调定义user_defined_macros允许您指定变量例dict(foo'bar')传递参数您模板中{{ foo }}外指定user_defined_filters允许您注册您滤器例dict(hellolambda name 'Hello s' name)传递参数您模板中{{ 'world' | hello }}关定义滤器详细信息请参阅Jinja文档
For more information on the variables and macros that can be referenced in templates make sure to read through the Macros section
设置赖关系
We have two simple tasks that do not depend on each other Here’s a few ways you can define dependencies between them 两相互赖简单务 定义间赖关系方法:
t2set_upstream(t1)

# This means that t2 will depend on t1
# running successfully to run
# It is equivalent to
# t1set_downstream(t2)

t3set_upstream(t1)

# all of this is equivalent to
# dagset_dependency('print_date' 'sleep')
# dagset_dependency('print_date' 'templated')
请注意执行脚时Airflow会DAG中找循环次引赖项时引发异常
概括
吧非常基DAG时您代码应示:

Code that goes along with the Airflow tutorial located at
httpsgithubcomapacheincubatorairflowblobmasterairflowexample_dagstutorialpy

from airflow import DAG
from airflowoperatorsbash_operator import BashOperator
from datetime import datetime timedelta


default_args {
'owner' 'airflow'
'depends_on_past' False
'start_date' datetime(2015 6 1)
'email' ['airflow@examplecom']
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
# 'queue' 'bash_queue'
# 'pool' 'backfill'
# 'priority_weight' 10
# 'end_date' datetime(2016 1 1)
}

dag DAG(
'tutorial' default_argsdefault_args schedule_intervaltimedelta(days1))

# t1 t2 and t3 are examples of tasks created by instantiating operators
t1 BashOperator(
task_id'print_date'
bash_command'date'
dagdag)

t2 BashOperator(
task_id'sleep'
bash_command'sleep 5'
retries3
dagdag)

templated_command
{ for i in range(5) }
echo {{ ds }}
echo {{ macrosds_add(ds 7)}}
echo {{ paramsmy_param }}
{ endfor }


t3 BashOperator(
task_id'templated'
bash_commandtemplated_command
params{'my_param' 'Parameter I passed in'}
dagdag)

t2set_upstream(t1)
t3set_upstream(t1)
AWS应实例 – EMR集群
This DAG demonstrates DAG parameters shows the use of the Airflow Operators and build dependencies The DAG accomplishes the following DAG演示DAG参数显示Airflow操作符构建赖关系 DAG完成务:
· Spinup an EMR cluster旋转EMR集群
Copy some from airflow import DAG
from airflowoperators import EmrOperator
from airflowoperators import GenieHiveOperator GenieS3DistCpOperator \
    GeniePigOperator GenieSparkOperator
from datetime import datetime timedelta
from batch_common import DMError BICommon
 
# import profile properties
profile BICommon()get_profile()
job_user profile['JOB_USER']
 
 
# Static Values
CLUSTER_NAME 'gck_af_beta_test_demo_129'
ON_FAILURE_CB EmrOperator(owner'noowner' task_id'notask' cluster_action'terminate' cluster_nameCLUSTER_NAME)execute
 
default_args {
    'owner' job_user
    'wait_for_downstream' True
    'start_date' datetime(2017 1 29)
    'email' ['LstDigitalTechNGAPAlerts@nikecom']
    'email_on_failure' True
    'email_on_retry' False
    'retries' 1
    'queue' 'airflow'
    'retry_delay' timedelta(seconds10)
    'on_failure_callback' ON_FAILURE_CB
}
 
dag DAG('gck_af_beta_test_demo_129' default_argsdefault_args schedule_intervaltimedelta(days1))
 
distcp_cmd src s3ainbounddatascienceadhocngap2_beta_testsdata dest hdfsuserhadoopdatalanding
hive_cmd f s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplehivejobhql
pig_cmd f s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplepigjobpig
spark_cmd s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplesparkjobpy
 
task1 EmrOperator(   task_id 'EmrSpinup'
    cluster_action 'spinup'
    cluster_name CLUSTER_NAME
    num_task_nodes 1
    num_core_nodes 1
    classification 'bronze'
    queue 'airflow'
    cost_center 'COST_CENTER'
    emr_version '500'
    applications ['hive' 'pig' 'spark']
    dag dag
)
 
task2 GenieS3DistCpOperator(   task_id 'TestDistcpStep'
    command distcp_cmd
    job_name distcpgeniejob
    queue 'airflow'
    sched_type CLUSTER_NAME
    dag dag
)
 
task3 GenieHiveOperator(   task_id 'TestHiveStep'
    command hive_cmd
    job_name 'hivegeniejob'
    queue 'airflow'
    sched_type CLUSTER_NAME
    dag dag
)
 
task4 GeniePigOperator(   task_id 'TestPigStep'
    command pig_cmd
    job_name 'piggeniejob'
    queue 'airflow'
    sched_type CLUSTER_NAME
    dag dag
)
 
task5 GenieSparkOperator(   task_id 'TestSparkStep'
    command spark_cmd
    job_name 'sparkgeniejob'
    queue 'airflow'
    sched_type CLUSTER_NAME
    dag dag
)
 
task6 EmrOperator(   task_id 'EmrTerminate'
    cluster_action 'terminate'
    cluster_name CLUSTER_NAME
    queue 'airflow'
    dag dag
)
 
# Set Dependencies
task2set_upstream(task1)
task3set_upstream(task2)
task4set_upstream(task3)
task5set_upstream(task4)
task6set_upstream(task5)
· task1set_downstream(task6) data files from S3 to HDFS
· Run Hive Pig and Spark tasks on that sample data
Terminate the EMR cluster

测试
运行脚
Time to run some tests First let’s make sure that the pipeline parses Let’s assume we’re saving the code from the previous step in tutorialpy in the DAGs folder referenced in your airflowcfg The default location for your DAGs is ~airflowdags 时候进行测试 首先确保道解析 假设正保存airflowcfg中引DAGs文件夹中tutorialpy中步代码 DAG默认位置〜 airflow dags
python ~airflowdagstutorialpy
If the script does not raise an exception it means that you haven’t done anything horribly wrong and that your Airflow environment is somewhat sound 果脚没引发异常意味着您没做怕错误您Airflow环境健全
命令行元数验证
通运行命令进步验证脚
# print the list of active DAGs
airflow list_dags

# prints the list of tasks the tutorial dag_id
airflow list_tasks tutorial

# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial tree
测试
Let’s test by running the actual task instances on a specific date The date specified in this context is an execution_date which simulates the scheduler running your task or dag at a specific date + time 通特定日期运行实际务实例进行测试 文中指定日期execution_date模拟特定日期+时间运行务dag调度程序:
# command layout command subcommand dag_id task_id date

# testing print_date
airflow test tutorial print_date 20150601

# testing sleep
airflow test tutorial sleep 20150601
Now remember what we did with templating earlier See how this template gets rendered and executed by running this command 现记前模板做事? 通运行命令解呈现执行模板:
# testing templated
airflow test tutorial templated 20150601
This should result in displaying a verbose log of events and ultimately running your bash command and printing the result 应该导致显示详细事件日志终运行bash命令印结果
Note that the airflow test command runs task instances locally outputs their log to stdout (on screen) doesn’t bother with dependencies and doesn’t communicate state (running success failed …) to the database It simply allows testing a single task instance 请注意airflow test命令运行务实例日志输出stdout(屏幕)赖赖项数库传达状态(运行成功失败) 允许测试单务实例
Backfill回填
Everything looks like it’s running fine so let’s run a backfill backfill will respect your dependencies emit logs into files and talk to the database to record status If you do have a webserver up you’ll be able to track the progress airflow webserver will start a web server if you are interested in tracking the progress visually as your backfill progresses 切起运行良运行回填 回填尊重您赖关系日志发送文件数库通信记录状态 果您网络服务器您够踪进度 果您兴趣回填程中直观踪进度airflow webserver启动Web服务器
Note that if you use depends_on_pastTrue individual task instances will depend on the success of the preceding task instance except for the start_date specified itself for which this dependency is disregarded 请注意果depends_on_past True单务实例取决前面务实例成功指定身start_date赖关系忽略
The date range in this context is a start_date and optionally an end_date which are used to populate the run schedule with task instances from this dag 文中日期范围start_date选end_datedag中务实例填充运行计划
# optional start a web server in debug mode in the background
# airflow webserver debug &

# start your backfill on a date range
airflow backfill tutorial s 20150601 e 20150607
步什?
That’s it you’ve written tested and backfilled your very first Airflow pipeline Merging your code into a code repository that has a master scheduler running against it should get it to get triggered and run every day 样已编写测试回填第Airflow道 代码合具针运行调度程序代码存储库中应该天触发运行
Here’s a few things you might want to do next 您想做事情:
· 深入解户界面 点击容
· Keep reading the docs Especially the sections on
o Command line interface
o Operators
o Macros
· Write your first pipeline
AWS部署运行
Step 1 Copy the DAG to the Airflow Scheduler
· Click on httpttygckngap2nikecom8080
· You will be in the airflow home directory  varlibairflow
· Change directory to varlibairflowdags
· cd dags
· Make sure the start date of the dag is updated
· If you don't have a subdirectory underneath the dags folder you will want to create one as the push_dags command expects a subdirectory 
· mkdir
· cd
· Copy your DAG into a python file or optionally use the AWS S3 CLI
· aws s3 cp s3nikeemrbingckairflowdagsgck_af_beta_test_demo_129py gck_af_beta_test_demo_228py
· Sync your DAG from the Airflow Scheduler to the Workers
· push_dag
· follow the prompts
· For a development environment this might be the way you copy your DAGs to the Airflow Scheduler and Workers
· For a test or production environment this would typically be handled through a Jenkins build
You're not limited by the EMR versions in the PaaSUI
If you are spinning up a cluster in your dag you can choose any EMR version
Airflow Scheduler TTY Shell

Additionally you can capture the IP address from this TTY shell or from Flower and login through your local terminal with Jumpbox
 Step 2 Refresh Airflow DAG and Turn it On from the Scheduler GUI
· Click on httpairflowgckngap2nikecom8080
· Locate your DAG
· Refresh DAG
· Turn DAG on
 Airflow Scheduler GUI DAG Refresh

 
Step 3 Open Airflow DAG and Run
· Click on first task and Run if not already on a schedule
Airflow Scheduler GUI Run DAG

Step 4 View DAG Tasks in Celery Flower
· Click on httpflowergckngap2nikecom8080
· Click on Tasks
Celery Flower GUI Tasks

AWS查日志
View Airflow Logs
· Click on a Genie Operator Task and select the View Logs button
· Notice the link to Kibana to visualize your logs
· Click the Kibana Link and notice the EMR Cluster Name is now pinned you should only see your cluster logs
Airflow Scheduler GUI Task Log

Kibana Dashboard for EMR Cluster Metrics

Airflow GPU WorkersTTY
Airflow GPU worker nodes will have tty access when the Instances is spun up and the access will be revoked when the Instance is terminated This applies to Tensorflow workers (p2xlarge p28xlarge and p216xlarge) Instances and AnacondaR worker Instances Airflow Operators TensorflowSpinUp and TensorflowTerminate is used in the Airflow DAG's to Spinup and Terminate the Tensorflow Instances For more information click here Similarly Airflow Operators AnacondaRSpinUpOperator and AnacondaRTerminateOperator is used in the Airflow DAG's to Spinup and Terminate the AnacondaR Instances For more information click here
 
An example of a tty endpoint of a Tensorflow worker node is given below
ttytfclusternamequeuenamengap2nikecom8080
where 
clustername will be the name of the Airflow Cluster (ex nikeplus nikeplusqa)
queuename will be the name of the Tensorflow Queue (ex tensorflow_tfp2xlarge1)
If an instance is spunup in nikeplusqa environment in the queue tensorflow_tfp2xlarge2 the endpoint will be like
ttytfnikeplusqatensorflow_tfp2xlarge2ngap2nikecom8080

 
An example of a tty endpoint of a AnacondaR worker node is given below
ttytfclusternamequeuenamengap2nikecom8080
where
clustername will be the name of the Airflow Cluster (ex dtc dtcdev dtcqa gckengineering etc)
queuename will be the name of the AnacondaR Queue (ex anacondarr38xlarge1 )
If an instance is spunup in dtcdev environment in the queue  anacondarr38xlarge1 the endpoint will be like
ttytfdtcdevanacondarr38xlarge1ngap2nikecom8080
关Airflow
Airflow User Guides Documentation
2Airflow Clusters
3 Team Access to Clusters
4Accessing Airflow GPU Nodes

Apache Airflow is our jobscheduling utility This is used to create pipelines usually for ETL operationsJobs are ran on our EMR clusters and managed via a variety of tools

Access
The following services are provided
Airflow
DAG interface
Celery
Cluster health and monitoring

These services and clusters can be found in the PaaSUI The master node can also be accessed via the SSHRemotely tool
Security
Access is determined by your NGAP2 AD group Only group members can see (and access) your clusters and jobs
To access Okera clusters utilize these paramaters in the EMR spinup step of your DAG
cerebro_cdasTrue
tags[
{Keycerebro_clusterValueprofile['ENV']}
]

Airflow User Guides Documentation
Topic
Link
Notes
Topic
Link
Notes
Airflow Tutorial
httpsairflowincubatorapacheorgtutorialhtml
REQUIRED reference for all Airflow users Start here Return when you are stuck
DAG Development
Airflow Best Practices DAG Development
Validating and Testing DAGs
Airflow Operators (NGAP 2)
Airflow REST API
Airflow Basics Overview Presentation
Internal documents to guide Airflow developers
Release 32 adds SQS Operator Snowflake python connector Cerberus libraries R Libraries
Airflow Operations
FAQs DAG problems Airflow tasks etc
Airflow Operations Best Practices
Content on common Airflow problems
Key points on Airflow operations 
Airflow Production DAG Status Dashboard
For DTC Users NGAP2 Airflow Status Dashboard
For GCK Users NGAP2 Airflow Status Dashboard
For Default Site NGAP 20 Airflow DAG Status Dashboard
These Dashboards are meant to check the realtime status for the DAG runs in the Production Environments for past 3 days
Currently published in Default DTCAnalytics and GCK Sites
Airflow Clusters
Each NGAP 2 team that needs Airflow will get a dedicated Airflow cluster with 3 user components
Airflow URLhttpairflowngap2nikecom8080
Celery Flower URLhttpflowerngap2nikecom8080
If found necessary teams can request multiple AF clusters for use in their development workflow (devqaprod) 
Team Access to Clusters
Team
URLs
Notesdetail
Cluster Names(lower case)
plugin version
Team
URLs
Notesdetail
Cluster Names(lower case)
plugin version
GCK
httpairflowgckngap2nikecom8080
httpflowergckngap2nikecom8080
httpttygckngap2nikecom8080
GCK PROD with Tensorflow instances
airflowtropospheretensorflowr
gck

0126
GCKWG
httpairflowgckwgngap2nikecom8080
httpflowergckwgngap2nikecom8080
httpttygckwgngap2nikecom8080
GCKWG Prod
gckwg
0127
GCK Engineering
httpairflowgckengineeringngap2nikecom8080
httpflowergckengineeringngap2nikecom8080
httpttygckengineeringngap2nikecom8080
GCK Engineering PROD
gckengineering
anacondar
0126

httpairflowgckengineeringdevngap2nikecom8080
httpflowergckengineeringdevngap2nikecom8080
httpttygckengineeringdevngap2nikecom8080
GCK Engineering Dev
gckengineeringdev
anacondar
0127

httpairflowgckengineeringqangap2nikecom8080
httpflowergckengineeringqangap2nikecom8080
httpttygckengineeringqangap2nikecom8080
GCK Engineering QA
gckengineeeringqa
anacondar
0127

httpairflowgckengineeringuatngap2nikecom8080
httpflowergckengineeringuatngap2nikecom8080
httpttygckengineeringuatngap2nikecom8080
GCK Engineering UAT
gckengineeeringuat
0127
GCK Engineering WG
httpairflowgckengineeringwgngap2nikecom8080
httpflowergckengineeringwgngap2nikecom8080
httpttygckengineeringwgngap2nikecom8080
GCK Eng WG PROD
gckengineeringwg
0126
ChinaBIEngineering
httpairflowchinabiengineeringngap2nikecom8080
httpflowerchinabiengineeringngap2nikecom8080
httpttychinabiengineeringngap2nikecom8080
ChinaBI Engineering Prod
ChinaBIEngineering
0128
DTC
 httpairflowdtcngap2nikecom8080
httpflowerdtcngap2nikecom8080
httpttydtcngap2nikecom8080
DTC PROD
dtc
anacondar
0126

httpairflowdtcdevngap2nikecom8080
httpflowerdtcdevngap2nikecom8080
httpttydtcdevngap2nikecom8080
DTC DEV
dtcdev
anacondar
0127

httpairflowdtcqangap2nikecom8080
httpflowerdtcqangap2nikecom8080
httpttydtcqangap2nikecom8080
DTC QA
dtcqa
anacondar
0126
DTC Engineering
httpairflowdtcengineeringngap2nikecom8080
httpflowerdtcengineeringngap2nikecom8080
httpttydtcengineeringngap2nikecom8080
DTCengineering PROD
dtcengineering

0125

httpairflowdtcengineeringdevngap2nikecom8080
httpflowerdtcengineeringdevngap2nikecom8080
httpttydtcengineeringdevngap2nikecom8080
DTCengineering DEV
dtcengineeringdev

0127

httpairflowdtcengineeringqangap2nikecom8080
httpflowerdtcengineeringqangap2nikecom8080
httpttydtcengineeringqangap2nikecom8080
DTCengineering QA
dtcengineeringqa

0126
DTC Engineering WG
httpairflowdtcengineeringwgngap2nikecom8080
httpflowerdtcengineeringwgngap2nikecom8080
httpttydtcengineeringwgngap2nikecom8080
DTCengineeringWG PROD
dtcengineeringwg
0125
RDF Engineering
httpairflowrdfengineeringngap2nikecom8080
httpflowerrdfengineeringngap2nikecom8080
httpttyrdfengineeringngap2nikecom8080
RDF Engineering Prod
rdfengineering
0124

httpairflowrdfengineeringdevngap2nikecom8080
httpflowerrdfengineeringdevngap2nikecom8080
httpttyrdfengineeringdevngap2nikecom8080
RDF Engineering DEV
rdfengineeringdev
0127

httpairflowrdfengineeringqangap2nikecom8080
httpflowerrdfengineeringqangap2nikecom8080
httpttyrdfengineeringqangap2nikecom8080
RDF Engineering QA
rdfengineeringqa
0127
RDF Engineering WG
httpairflowrdfengineeringwgngap2nikecom8080
httpflowerrdfengineeringwgngap2nikecom8080
httpttyrdfengineeringwgngap2nikecom8080
RDF Engineering WG Prod
rdfengineering
0126

httpairflowrdfengineeringwgqangap2nikecom8080
httpflowerrdfengineeringwgqangap2nikecom8080
httpttyrdfengineeringwgqangap2nikecom8080
RDF Engineering WG QA
rdfengineeringqa
0127
EU BI
httpairfloweubidevngap2nikecom8080
httpflowereubidevngap2nikecom8080
httpttyeubidevngap2nikecom8080
EU BI Dev
eubidev
0127

httpairfloweubiqangap2nikecom8080
httpflowereubiqangap2nikecom8080
httpttyeubiqangap2nikecom8080
EU BI QA
eubiqa
0127

httpairfloweubingap2nikecom8080
httpflowereubingap2nikecom8080
httpttyeubingap2nikecom8080
EU BI Prod
eubi
0124
NikePlus
httpairflownikeplusngap2nikecom8080
httpflowernikeplusngap2nikecom8080
httpttynikeplusngap2nikecom8080
NIKEPLUS PROD
nikeplus

0126

httpairflownikeplusqangap2nikecom8080
httpflowernikeplusqangap2nikecom8080
httpttynikeplusqangap2nikecom8080
NIKEPLUS NONPROD QA with Tensorflow instances

nikeplusqa
0127
RTLengineering
httpairflowrtlengineeringngap2nikecom8080
httpflowerrtlengineeringngap2nikecom8080
httpttyrtlengineeringngap2nikecom8080
RTLEngineering PROD
rtlengineering
0126

httpairflowrtlengineeringdevngap2nikecom8080
httpflowerrtlengineeringdevngap2nikecom8080
httpttyrtlengineeringdevngap2nikecom8080
RTLEngineering Dev
rtlengineeringdev
0127

httpairflowrtlengineeringqangap2nikecom8080
httpflowerrtlengineeringqangap2nikecom8080
httpttyrtlengineeringqangap2nikecom8080
RTLEngineering QA
rtlengineeringqa
0127
UserServicesSocial
httpairflowuserservicessocialngap2nikecom
httpfloweruserservicessocialngap2nikecom
httpttyuserservicessocialngap2nikecom
User Services Social Prod
userservicessocial
0124

httpairflowuserservicessocialdevngap2nikecom
httpfloweruserservicessocialdevngap2nikecom
httpttyuserservicessocialdevngap2nikecom
 User Service Social Dev
userservicessocialdev
0127
NDeInnovation
airflowndeinnovationdevngap2nikecom8080
flowerndeinnovationdevngap2nikecom8080
ttyndeinnovationdevngap2nikecom8080
NDeInnovation Dev
ndeinnovationdev
0127

airflowndeinnovationngap2nikecom8080
flowerndeinnovationngap2nikecom8080
ttyndeinnovationngap2nikecom8080
NDeInnovation Prod
ndeinnovation
0124
RetailExperience
httpairflowretailexperiencedevngap2nikecom8080
httpflowerretailexperiencedevngap2nikecom8080
httpttyretailexperiencedevngap2nikecom8080
RetailExperience Dev
retailexperiencedev
0127

httpairflowretailexperiencengap2nikecom8080
httpflowerretailexperiencengap2nikecom8080
httpttyretailexperiencengap2nikecom8080
RetailExperience Prod
retailexperience
0124
RFSengineeringWG
httpairflowrfsengineeringwgqangap2nikecom8080
httpflowerrfsengineeringwgqangap2nikecom8080
httpttyrfsengineeringwgqangap2nikecom8080
RFSengineeringWG QA
rfsengineeringwgqa
0127

httpairflowrfsengineeringwgngap2nikecom8080
httpflowerrfsengineeringwgngap2nikecom8080
httpttyrfsengineeringwgngap2nikecom8080
RFSengineeringWG Prod
rfsengineeringwg
0122
MPAA
httpairflowmpaangap2nikecom8080
httpflowermpaangap2nikecom8080
httpttympaangap2nikecom8080
MPAA Prod
mpaa
0124

httpairflowmpaaqangap2nikecom8080
httpflowermpaaqangap2nikecom8080
httpttympaaqangap2nikecom8080
MPAA QA
mpaaqa
0127
Integrated Knowledge
httpairflowintegratedknowledgengap2nikecom8080
httpflowerintegratedknowledgengap2nikecom8080
httpttyintegratedknowledgengap2nikecom8080
Integrated Knowledge Prod
integratedknowledge
0125
Search Engineering
httpairflowsearchengineeringngap2nikecom
httpflowersearchengineeringngap2nikecom
httpttysearchengineeringngap2nikecom
Search Engineering Prod
searchengineering
0125
GlobalSupplyChain
httpairflowglobalsupplychainngap2nikecom
httpflowerglobalsupplychainngap2nikecom
httpttyglobalsupplychainngap2nikecom
GlobalSupplyChain Prod
globalsupplychain
0125
Platform Support
httpairflowplatformsupportngap2nikecom8080admin
httpflowerplatformsupportngap2nikecom8080
httpttyplatformsupportngap2nikecom8080
Platform Support
platformsupport
0125
PME
httpairflowpmengap2nikecom8080
httpflowerpmengap2nikecom8080
httpttypmengap2nikecom8080
PME
pme
0126
Adhoc Query Cluster for spinning up Presto Enabled EMR clusters everyday
httpairflowaqclusterngap2nikecom8080
httpfloweraqclusterngap2nikecom8080
httpttyaqclusterngap2nikecom8080
Alpha
aqcluster
0126
DSMPlanningAnalytics
httpairflowdsmplanninganalyticsngap2nikecom8080
httpflowerdsmplanninganalyticsngap2nikecom8080
httpttydsmplanninganalyticsngap2nikecom8080
DSMPlanningAnalytics
dsmplanninganalytics
0127

httpairflowdsmplanninganalyticsqangap2nikecom8080
httpflowerdsmplanninganalyticsqangap2nikecom8080
httpttydsmplanninganalyticsqangap2nikecom8080
DSMPlanningAnalytics QA
dsmplanninganalyticsqa
0127
GlobalROSE
httpairflowglobalrosengap2nikecom8080
httpflowerglobalrosengap2nikecom8080
httpttyglobalrosengap2nikecom8080
GlobalRose
globalrose
0127
EDAAML
httpairflowedaamlngap2nikecom8080
httpfloweredaamlngap2nikecom8080
httpttyedaamlngap2nikecom8080
EDAAML
edaaml
0127
Accessing Airflow GPU Nodes
Airflow GPU worker nodes will have tty access when the Instances is spun up and the access will be revoked when the Instance is terminated
This applies to Tensorflow workers (p2xlarge p28xlarge and p216xlarge) Instances and 
AnacondaR worker Instances 
 Airflow Operators TensorflowSpinUp and TensorflowTerminate is used in the Airflow DAG's to Spinup and Terminate the Tensorflow Instances For more information click here Similarly Airflow Operators AnacondaRSpinUpOperator and AnacondaRTerminateOperator is used in the Airflow DAG's to Spinup and Terminate the AnacondaR Instances For more information click here
AnacondaR
An example of a tty endpoint of a AnacondaR worker node is given below
ttytfclusternamequeuenamengap2nikecom8080
where
clustername will be the name of the Airflow Cluster (ex dtc dtcdev dtcqa gckengineering etc)
queuename will be the name of the AnacondaR Queue (ex anacondarr38xlarge1 )
If an instance is spunup in dtcdev environment in the queue  anacondarr38xlarge1 the endpoint will be like
ttytfdtcdevanacondarr38xlarge1ngap2nikecom8080
Tensorflow
An example of a tty endpoint of a Tensorflow worker node is given below
ttytfclusternamequeuenamengap2nikecom8080
where 
clustername will be the name of the Airflow Cluster (ex nikeplus nikeplusqa)
queuename will be the name of the Tensorflow Queue (ex tensorflow_tfp2xlarge1)
If an instance is spun up in nikeplusqa environment in the queue tensorflow_tfp2xlarge2 the endpoint will be like
ttytfnikeplusqatensorflow_tfp2xlarge2ngap2nikecom8080


Airflow Plugin Version
0126 EMR Plugin Changes for Cerebro
0125 GenieSnowflakeOperator 
0124  Critical Security Updates for Intel CPU issues OS moved from Ubuntu to Amazon Linux Operator changes (Athena to use boto EMR clusters with dstools instance fleet timeout_duration fixes Cerebro integration Genie pig operator)
0123 GeniePigBatchOperator Plugin changes for airflow not to error out when emr already exists extend timeout for airflow emr spinup in case of Instance Fleet extend timeout for dstools cluster spinupairflow operator change for Cerebro
0122 Support for Instance Fleet with Anaconda R and TF workers Additional enhancements for Anaconda R worker
0121 Plugin Changes for Instance Fleet
0120 Fix for SLAWatcher TTY Access for dynamic AnacondaR instances
0119 Tableau Extract Refresh Operator Anaconda R worker Spin & Terminate TTY Access for TF instances
Airflow Best Practices Dag Development
1  It's recommended to go to the Airflow tutorial to understand some basis of Airflow httppythonhostedorgairflowtutorialhtml
2 Default Args Fill in your team information owner email to identify your dag on the UI and get notified if the dag fails

Here is the DEFAULT_ARGS we have been using in the development environment
DEFAULT_ARGS {
'owner' 'etldev'
 'depends_on_past' True
 'wait_for_downstream' True
 'start_date' datetime(2017 01 18)
 'email' ['xxx@nikecom']
 'email_on_failure' True
 'email_on_retry' False
 'retries' 2
 'provide_context' True
 'retry_delay' timedelta(minutes2)
}
3 When setting up the default args we recommend setting the depends_on_past and wait_for_downstream to set to True and setting the last task as the downstream of the first task  This way the dag will wait for the previous day dag to complete before running the current day  However this is the command cause of a task being kicked off because the previous day task wasn't complete
4 Airflow loads the schedule on after the last minute of the day is complete  That means if you want a dag to start running on Feb 7 2017 the dag will run at 400pm (or 500pm summer hours) on Feb 7 which is actually Feb 8 UTC  
5 Make sure to update the start date if you have previous date on the start date Airflow will try to run all the previous day before catching up
6 For EMR dag we recommend retrying two times before failing the task
7 Make sure to name your dag so you can easily identify which directory the dag resides under varlibairflowdags directory
8 Make sure to validate the dag before deploying the dag to the cluster to check for error by typing 'airflow list_dags sd '' eg airflow list_dags sd testpy
9 Use persistent clusters
10 Use of sensor prior to the BatchtStartOperator




 
Parameters
· task_id (string) – a unique meaningful id for the task
· owner (string) – the owner of the task using the unix username is recommended
· retries (int) – the number of retries that should be performed before failing the task
· retry_delay (timedelta) – delay between retries
· start_date (datetime) – start date for the task the scheduler will start from this point in time Note that if you run a DAG on a schedule_interval of one day the run stamped 20160101 will be trigger soon after 20160101T2359 In other words the job instance is started once the period it covers has ended
· end_date (datetime) – if specified the scheduler won’t go beyond this date
· depends_on_past (bool) – when set to true task instances will run sequentially while relying on the previous task’s schedule to succeed The task instance for the start_date is allowed to run
· wait_for_downstream (bool) – when set to true an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully before it runs This is useful if the different instances of a task X alter the same asset and this asset is used by tasks downstream of task X Note that depends_on_past is forced to True wherever wait_for_downstream is used
· queue (str) – which queue to target when running this job Not all executors implement queue management the CeleryExecutor does support targeting specific queues
· dag (DAG) – a reference to the dag the task is attached to (if any)
· priority_weight (int) – priority weight of this task against other task This allows the executor to trigger higher priority tasks before others when things get backed up
· pool (str) – the slot pool this task should run in slot pools are a way to limit concurrency for certain tasks
· sla (datetimetimedelta) – time by which the job is expected to succeed Note that this represents the timedelta after the period is closed For example if you set an SLA of 1 hour the scheduler would send dan email soon after 100AM on the20160102 if the 20160101 instance has not succeede yet The scheduler pays special attention for jobs with an SLA and sends alert emails for sla misses SLA misses are also recorded in the database for future reference All tasks that share the same SLA time get bundled in a single email sent soon after that time SLA notification are sent once and only once for each task instance
· execution_timeout (datetimetimedelta) – max time allowed for the execution of this task instance if it goes beyond it will raise and fail
· on_failure_callback (callable) – a function to be called when a task instance of this task fails a context dictionary is passed as a single parameter to this function Context contains references to related objects to the task instance and is documented under the macros section of the API
· on_retry_callback – much like the on_failure_callback excepts that it is executed when retries occur
· on_success_callback (callable) – much like the on_failure_callback excepts that it is executed when the task succeeds
· trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered Options are{ all_success | all_failed | all_done | one_success | one_failed | dummy}default is all_success Options can be set as string or using the constants defined in the static class airflowutilsTriggerRule
Validating and Testing the DAG
Go to start of metadata
Reference
httppythonhostedorgairflowtutorialhtml
 
Validating the Script
Time to run some tests First let’s make sure that the pipeline parses Let’s assume we’re saving the code from the previous step in tutorialpy in the DAGs folder referenced in your airflowcfg The default location for your DAGs is ~airflowdags
 
 python ~airflowdagstutorialpy
 
If the script does not raise an exception it means that you haven’t done anything horribly wrong and that your Airflow environment is somewhat sound
 
Command Line Metadata Validation
Let’s run a few commands to validate this script further
 
 # validate dag for one dag file
airflow list_dags sd
eg airflow list_dags sd varlibairflowdagsplatformsamplepy
# prints the list of tasks for a dag that is named sample
airflow list_tasks sd
eg airflow list_tasks sd varlibariflowdagsplatformsamplepy sample
 
# prints the hierarchy of tasks in the tutorial DAG
airflow list_tasks tutorial sd tree
 
 
测试
Let’s test by running the actual task instances on a specific date The date specified in this context is an execution_date which simulates the scheduler running your task or dag at a specific date + time
 # command layout command subcommand dag_id task_id date
# testing print_date
airflow test sd
eg airflow test sd varlibairflowdagssamplepy sample sample_task 20150601

 
This should result in displaying a verbose log of events and ultimately running your bash command and printing the result
Note that the airflow test command runs task instances locally outputs their log to stdout (on screen) doesn’t bother with dependencies and doesn’t communicate state (running success failed ) to the database It simply allows testing a single task instance
Airflow operators
Import airflow operators in the DAG DAG中导入airflow operator:
from airflowoperators import
NGAP2 Airflow Operators 
All Airflow out of the box operators are available
· AnacondaR Operator
· AthenaOperator
· BashOperator (Ngap2)
· BatchStartOperator (NGAP2)
· BoxFileTransferOperator
· CanaryOperator (Ngap2)
· CatchUpOperator
· DMDoneOperator(Ngap2)
· DMErrorOperator (Ngap2)
· DMStartOperator (Ngap2)
· EMR Operator (Ngap2)
· GenieDistCpOperator (Ngap2)
· GenieHadoopOperator (Ngap2)
· GenieHiveOperator (Ngap2)
· GeniePigOperator (Ngap2)
· GenieSnowflakeOperator (Ngap2)
· GenieSparkOperator (Ngap2)
· GenieSqoopBatchOperator(Ngap2)
· GenieSqoopOperator(Ngap2)
· InsightsOperator(Ngap2)
· Slack Operator
· SnowFlakeOperator (Ngap 2)
· SQS Operators
· TabExtractRefreshOperator (NGAP2)
· TaskDependencySensor (NGAP2)
· TensorFlow Operators
Standard Airflow Operators
See Airflow Documentation for details httppythonhostedorgairflowcodehtml#operators
Airflow FAQ (Ngap2)
Go to start of metadata
Refer to Airflow FAQ  httppythonhostedorgairflowfaqhtml
· How do I deploy the DAG scripts to my airflow environment
· How do I know if my task was forced run or marked success
· How to create a common module to be used for airflow dags
· How to deploy profile to airflow cluster
· How to schedule a dag to run weekday only
· How to schedule multiple runs in a day
· My dag is not refreshed in the web UI
· Use Hive shared metastore for Pig jobs
· Why is my task not running
Airflow Best Practices Operations
1 Make sure to run 'airflow list_dags sd to validate the metadata before deploying the dag
2 Clean up dags that is not needed in the cluster  The number of dags in the cluster has impact on the performance of the cluster You will see more delay with more dags in the cluster See Purging Unused Airflow DAGS
3 When you put a dag on hold make sure to clear out any running tasks Any sensor task that is running will keep occupying a slot in the worker node
4 When you want to mark a running task success you will have to hold the dag clear the task and mark the task success This will clear out a the slot on the worker  If you just mark a running task success the task will continue to run in the backend keep occupying a slot and you won't be able to clear the slot
5 When promoting a dag from one environment to another environment make sure to update the start date of the dag
6 In case of cleaning up xcom table use the attached (delete_xcompy) script from any Airflow TTY Instance This will cleanup any data older than 30 days from the required metadata table
> python delete_xcompy
Rest API for Airflow
The source of this Rest API Plugins is from httpsgithubcomteamclairvoyantairflowrestapiplugin
Rest API can be called using curl command to run Airflow operations like run task pause dag
To use the Rest API you will have to get the Authentication Token just like when you need to access any URL from NGAP2
The detail documentation of the API is in your airflow UI
http8080adminrest_api 
Or
From the Airflow UI Admin → Rest API Plugin
 

 
Examples
AUTH_TOKEN
 
#UNPAUSE_DAG
curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10unpausedag_idtest_R

#PAUSE_DAG
curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10pausedag_idtest_R

#TASK STATE
#run_r
curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10task_statedag_idtest_R&task_idrun_r&execution_date20170418&subdir

#run_r_beeline
curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10task_statedag_idtest_R&task_idrun_r_beeline&execution_date20170418&subdir

#TEST A TASK run_r_beelin

curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10testdag_idtest_R&task_idrun_r_beeline&execution_date20170418&subdir&task_params
 
# RUN A TASK
curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10rundag_idtest_R&task_idrun_r&execution_date20170418&subdir&forceon&ignore_dependencieson&ignore_depends_on_paston

#UNPAUSE_DAG
curl v cookie cdtplatformauth{AUTH_TOKEN} X GET httpairflowgckengineeringdevngap2nikecom8080adminrest_apiapiv10unpausedag_idtest_R
Batch code deployment in Airflow
To deploy batch code from the BIG project httpsbitbucketnikecomprojectsBIG to NGAP2 Airflow cluster
Create an 'airflow_deploy' folder in awsbatch project put the scripts or prop folder inside this folder
Any folderscripts under the airflow_deploy directory will be synced down to all the airflow nodes scheduler and all workers node
See example

 
In airflow the code will be in appbindtccommerceairflow_deployprop and appbindtccommerceairflow_deployscripts
 
 
Go to Jenkins httpjenkinsngapnikecom8080viewdeploymentjobairflowbatchdeploy and enter the repository branch and the target airflow cluster name

After the Jenkins deploy the code will be synced down o the airflow cluster every 5 minutes
 
When you need to access the code from a dag use the exact path appbinairflow_deploy
Running R script from Airflow AnacondaR worker node
To run AnacondaR script in Airflow the cluster will have to be set up with the AnacondaR worker node  If you have this need please submit a JIRA ticket to the platform support team to update your Airflow cluster
If your cluster have the AnacondaR worker node set up use the 'anacondar' queue to submit the job to that worker
Use BashOperator to submit a bash command to run your R script You will need to enable an virtual environment to run the R script
source optdstoolsAnaconda3envsrenvbinactivate renv

If you need to use R to connect to Beeline to a running EMR cluster make sure to export the required variables
export JAVA_HOMEusrlibjvmdefaultjavaexport HIVE_HOMEusrlocalapachehive210bin export PATHHIVE_HOMEbinPATHexport CLASSPATHCLASSPATHusrlocalapachehive210binlib*export HADOOP_HOMEusrlocalhadoop273export CLASSPATHCLASSPATHusrlocalhadoop273lib*

Note Refer to Data Science Tools Old for more information about virtualenv and R script execution on the data science anaconda node
Example (Running just an R script)

from airflowoperators import BatchStartOperator
task2 BashOperator(
task_id'run_r'
bash_command'source optdstoolsAnaconda3envsrenvbinactivate renv Rscript appbinraR'
queue'anacondar'
dagdag)
 
aR code in the example
a < 42
A < a * 2
print(a)
print(A)
Example (Running an R script that connect to beeline to a running EMR cluster)

task3 BashOperator(
task_id'run_r_beeline'
bash_command'source optdstoolsAnaconda3envsrenvbinactivate renv export JAVA_HOMEusrlibjvmdefaultjavaexport HIVE_HOMEusrlocalapachehive210bin export PATHHIVE_HOMEbinPATHexport CLASSPATHCLASSPATHusrlocalapachehive210binlib*export HADOOP_HOMEusrlocalhadoop273export CLASSPATHCLASSPATHusrlocalhadoop273lib*Rscript appbinrrbeelineR'
queue'anacondar'
dagdag)
 
rbeelineR code in the example
rBeeQryngap2nikecom10000 userhadoop) {
beecmdngap2nikecom10000 n hadoop e' qry'sep)
qryres}
hqltxt<('select preferred_gender count(1) as n from membermember group by preferred_gender')
result←rBeeQry(hqltxt)
New Relic for Airflow
New Relic is a Performance monitoring and Management tool which is primarily used to get more insights from the Instance level New Relic will be a part of all the long running Airflow Nodes The Installation link for New Relic is as given below
httpsrpmnewreliccomget_started_with_server_monitoring
A sample dashboard is given below

 
 
New Relic also integrates with Cloud Health to provide more insights about Cost Management and Resource Utilization 
删DAG
This feature helps to delete a DAG from the airflow
How it works
It accepts DAG name as the input parameter and deletes the DAG related entries from s3airflow scheduler and airflow database
Usage
delete_dag

NoteSync script running every 15 mins in airflow workers will delete the dag from workers
Using BashOperator to persist variable
You can use BashOperator to persist an variable as long as the value is in the last line of the standard output
Here is an example who you do it in a task
get_latest_s3_file_membership_results BashOperator(
task_id'get_latest_s3_file_membership_results'
bash_command ' echo (aws s3 ls s3'+profile['S3_NIKEBI_MANAGED']+'ckmemberIDuserengagementmembership_results recursive | sort | tail n 1 | cut c8491)'
queue'airflow'
xcom_pushTrue
dagdag
)
The above task persist the latest s3 directory and persist the value into xcom
You can pull the value in the successor task by referring to the task_id and the key 'return_value'
eg
latest_ms_lkp_visitor_upm_member_file {{ task_instancexcom_pull(keyreturn_valuetask_idsget_latest_s3_file_ms_lkp_visitor_upm_member) }}
hive_cmd_membership_results_alter_table alter_table_membership_results_queryformat(profile['MEMBER_DB']lower()'membership_results's3_emr_membership_resultslatest_membership_results_file)
To pull the xcom value into the successor task you have to set a flag xcom_pullTrue
alter_membership_results_table AthenaOperator(
task_id'alter_membership_results_table'
queue'airflow'
xcom_pullTrue
command hive_cmd_membership_results_alter_table
dagdag)
Airflow Operation Support Policy
Go to start of metadata
We encourage the user to enter a support ticket to report any Airflow issues using the BIPS Airflow link Business Intelligence and Analytics Production Support
1 A support ticket is required for any nonproduction airflow issues
2 A support ticket is required for any production airflow issues that need investigation and troubleshooting 
3 The user is encouraged to report the Airflow production issue via the Airflow support slack channel #ngapairflow 
a Please make sure that you address your message to the platform team if a prompt response is expected

Operator教程
Setting up the sandbox in the Quick Start section was easy building a productiongrade environment requires a bit more work 快速入门部分中设置沙箱容易 建立生产级环境需更工作
These howto guides will step you through common tasks in using and configuring an Airflow environment 方法指南指导您逐步完成配置Airflow环境常见务
设置配置选项
The first time you run Airflow it will create a file called airflowcfg in your AIRFLOW_HOME directory (~airflow by default) This file contains Airflow’s configuration and you can edit it to change any of the settings You can also set options with environment variables by using this format AIRFLOW__{SECTION}__{KEY} (note the double underscores) 第次运行Airflow时 AIRFLOW_HOME目录中创建名airflowcfg文件(默认〜 airflow) 该文件包含Airflow配置您进行编辑更改设置 您格式环境变量设置选项: AIRFLOW __ {SECTION} __ {KEY}(请注意双划线)
For example the metadata database connection string can either be set in airflowcfg like this 例airflowcfg中设置元数数库连接字符串示:
[core]
sql_alchemy_conn my_conn_string
or by creating a corresponding environment variable 通创建相应环境变量:
AIRFLOW__CORE__SQL_ALCHEMY_CONNmy_conn_string
You can also derive the connection string at run time by appending _cmd to the key like this 您运行时通_cmd附加键派生连接字符串示:
[core]
sql_alchemy_conn_cmd bash_command_to_run
The following config options support this _cmd version 配置选项支持_cmd版:
· sql_alchemy_conn in [core] section
· fernet_key in [core] section
· broker_url in [celery] section
· result_backend in [celery] section
· password in [atlas] section
· smtp_password in [smtp] section
· bind_password in [ldap] section
· git_password in [kubernetes] section
The idea behind this is to not store passwords on boxes in plain text files 背想法密码存储纯文文件框中
The order of precedence for all connfig options is as follows 配置选项优先序
1 environment variable
2 configuration in airflowcfg
3 command in airflowcfg
4 Airflow’s built in defaults
初始化数库端
If you want to take a real test drive of Airflow you should consider setting up a real database backend and switching to the LocalExecutor 果Airflow进行真实测试应考虑设置真实数库端切换LocalExecutor
As Airflow was built to interact with its metadata using the great SqlAlchemy library you should be able to use any database backend supported as a SqlAlchemy backend We recommend using MySQL or Postgres Airflow强SqlAlchemy库构建元数进行交互您应该够支持数库端作SqlAlchemy端 建议MySQLPostgres
Note
We rely on more strict ANSI SQL settings for MySQL in order to have sane defaults Make sure to have specified explicit_defaults_for_timestamp1 in your mycnf under [mysqld] 获合理默认值MySQL赖更严格ANSI SQL设置 确保mycnf中[mysqld]指定explicit_defaults_for_timestamp1
注意
If you decide to use Postgres we recommend using the psycopg2 driver and specifying it in your SqlAlchemy connection string Also note that since SqlAlchemy does not expose a way to target a specific schema in the Postgres connection URI you may want to set a default schema for your role with a command similar to ALTER ROLE username SET search_path airflow foobar 果决定Postgres建议您psycopg2驱动程序SqlAlchemy连接字符串中指定 注意SqlAlchemy未提供Postgres连接URI中定位特定模式方法您想类似ALTER ROLE username SET search_path airflow foobar命令角色设置默认模式
Once you’ve setup your database to host Airflow you’ll need to alter the SqlAlchemy connection string located in your configuration file AIRFLOW_HOMEairflowcfg You should then also change the executor setting to use LocalExecutor an executor that can parallelize task instances locally 数库设置托Airflow您需更改配置文件AIRFLOW_HOMEairflowcfg中SqlAlchemy连接字符串 然您应该执行程序设置更改 LocalExecutor该执行程序行化务实例
# initialize the database
airflow initdb
Operators
An operator represents a single ideally idempotent task Operators determine what actually executes when your DAG runs 操作员代表理想幂等务 操作员确定DAG运行时实际执行操作
See the Operators Concepts documentation and the Operators API Reference for more information 关更信息请参阅运营商概念文档运营商API参考
· BashOperator
o Templating
o Troubleshooting
§ Jinja template not found
· PythonOperator
o Passing in arguments
o Templating
· Google Cloud Platform Operators
o GoogleCloudStorageToBigQueryOperator
o GceInstanceStartOperator
o GceInstanceStopOperator
o GceSetMachineTypeOperator
o GcfFunctionDeleteOperator
§ Troubleshooting
o GcfFunctionDeployOperator
§ Troubleshooting
Airflow项目实例
分享Airflow Howto示例 实例均已Airflow 1713进行测试 更高版Airflow应该正常工作未测试 (NGAP2Airflow 1713该版测试)
基础实例
· airflow_example_ui_setting Airflow理Web控制台中设置DAG外观
· airflow_example_task_branch 3 examples to choose task branch in one DAG
· airflow_example_python_function 调Python函数
· airflow_example_pass_params CLI传递参数手动触发DAG
· airflow_example_skip_task DAG中跳务
· airflow_example_depends_dag_first airflow_example_depends_dag_second DAG中务赖项
· airflow_example_trigger_controller_dag airflow_example_trigger_target_dag 触发DAG
Airflow理Web控制台中设置DAG外观
airflow_example_ui_settingpy
# *codingutf8*

# Setup how DAG looks in Airflow admin web console

from airflow import DAG
from airflowoperators import DummyOperator
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_ui_setting' default_args args)

# Set description for DAG in Markdown format
dagdoc_md '''
### DAG Documentation
Put Summary information here to describe your DAG
[airflow](httpsairflowapacheorg)
'''

#
# task
#
task1 DummyOperator(task_id'task1' dagdag)
task2 DummyOperator(task_id'task2'dagdag)

# Set description for Task in Markdown format
task1doc_md '''
### Task Documentation
Put Summary information here to describe your task which gets
rendered in the UI's Task Instance Details page
'''

# Set the background color for the task node in Graph View Tree View
task1ui_color '#FF2D00'
# Set the font color for the task node in Graph View
task1ui_fgcolor '#003AFF'

#
# Dependency
#
task2set_upstream(task1)

if __name__ __main__
dagcli()
Task Branch with Dynamic Condition Parameter passed by cli Paramter from Pervious Task动态条件参数务CLI 参数传递务分支
airflow_example_task_branchpy
# *codingutf8*

# Example for task branch with dynamic condition parameter passed by cli paramter from pervious task

from airflow import DAG
from airflowoperators import PythonOperator DummyOperator BranchPythonOperator
from airflowutilstrigger_rule import TriggerRule
from datetime import datetime timedelta
import logging

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_task_branch' default_args args)

#
# python function
#

def branch_example1_fun()
return 'example1_fork{}'format(int(datetimenow()strftime('S')) 2 + 1)

def branch_example2_fun(**kwargs)
fork_num 1
if(kwargs['dag_run']conf is not None and kwargs['dag_run']confget('fork_num') is not None)
fork_num kwargs['dag_run']confget('fork_num')

return 'example2_fork{}'format(fork_num)

# there are 2 ways to transfer variable via XCom
# 1 pushes an XCom with a specific key
# 2 pushes an XCom just by returning it
def branch_example3_xcom1_fun(**kwargs)
kwargs['ti']xcom_push(key'branch_task1' value'example3_fork')
def branch_example3_xcom2_fun(**kwargs)
return int(datetimenow()strftime('S')) 2 + 1

# get variables from both example3_xcom1 and example3_xcom2
# and connect the variables as the task name for choosing branch
def branch_example3_fun(**kwargs)
ti kwargs['ti']
logginginfo('')
logginginfo('1 Xcom variable {} 2 Xcom variable {}'format(
tixcom_pull(keyNone task_ids'example3_xcom1')
tixcom_pull(task_ids'example3_xcom2')
))
logginginfo('')

return '{}{}'format(tixcom_pull(keyNone task_ids'example3_xcom1') tixcom_pull(task_ids'example3_xcom2'))

#
# Dynamicly choose the branch according to the seconds
#
example1_start DummyOperator(task_id'example1_start' dagdag)
example1_branch BranchPythonOperator(
task_id'example1_branch'
python_callablebranch_example1_fun
dagdag)
example1_fork1 DummyOperator(task_id'example1_fork1' dagdag)
example1_fork2 DummyOperator(task_id'example1_fork2' dagdag)
# the default is ALL_SUCCESS which will skip all downstream tasks Since there are one task skipped
# so the example1_done need to setup trigger_rule to ONE_SUCCESS
example1_done DummyOperator(task_id'example1_done' trigger_ruleTriggerRuleONE_SUCCESS dagdag)
example1_end DummyOperator(task_id'example1_end' dagdag)

example1_startset_downstream(example1_branch)
example1_branchset_downstream(example1_fork1)
example1_branchset_downstream(example1_fork2)
example1_fork1set_downstream(example1_done)
example1_fork2set_downstream(example1_done)
example1_doneset_downstream(example1_end)

#
# Default Branch can change branch by parameter passed by cli when manully trigger DAG
#
# default branch is fork1 branch use the following cli to trigger dag to execute fork2 branch
# airflow trigger_dag airflow_example_task_branch c '{fork_num2}'
example2_start DummyOperator(task_id'example2_start' dagdag)
example2_branch BranchPythonOperator(
task_id'example2_branch'
provide_contextTrue
python_callablebranch_example2_fun
dagdag)
example2_fork1 DummyOperator(task_id'example2_fork1' dagdag)
example2_fork2 DummyOperator(task_id'example2_fork2' dagdag)
example2_done DummyOperator(task_id'example2_done' trigger_ruleTriggerRuleONE_SUCCESS dagdag)
example2_end DummyOperator(task_id'example2_end' dagdag)

example2_startset_downstream(example2_branch)
example2_branchset_downstream(example2_fork1)
example2_branchset_downstream(example2_fork2)
example2_fork1set_downstream(example2_done)
example2_fork2set_downstream(example2_done)
example2_doneset_downstream(example2_end)

#
# Dynamic choose branch based on the variable transfered via XComs
#
# default branch is fork1 branch use the following cli to trigger dag to execute fork2 branch
# airflow trigger_dag airflow_example_task_branch c '{fork_num2}'
example3_start DummyOperator(task_id'example3_start' dagdag)
example3_xcom1 PythonOperator(
task_id'example3_xcom1'
provide_contextTrue
python_callablebranch_example3_xcom1_fun
dagdag)
example3_xcom2 PythonOperator(
task_id'example3_xcom2'
provide_contextTrue
python_callablebranch_example3_xcom2_fun
dagdag)
example3_branch BranchPythonOperator(
task_id'example3_branch'
provide_contextTrue
python_callablebranch_example3_fun
dagdag)
example3_fork1 DummyOperator(task_id'example3_fork1' dagdag)
example3_fork2 DummyOperator(task_id'example3_fork2' dagdag)
example3_done DummyOperator(task_id'example3_done' trigger_ruleTriggerRuleONE_SUCCESS dagdag)
example3_end DummyOperator(task_id'example3_end' dagdag)

example3_startset_downstream(example3_xcom1)
example3_startset_downstream(example3_xcom2)
example3_xcom1set_downstream(example3_branch)
example3_xcom2set_downstream(example3_branch)
example3_branchset_downstream(example3_fork1)
example3_branchset_downstream(example3_fork2)
example3_fork1set_downstream(example3_done)
example3_fork2set_downstream(example3_done)
example3_doneset_downstream(example3_end)

if __name__ __main__
dagcli()
调Python函数
airflow_example_python_functionpy
# *codingutf8*

# Example for how to invoke python funtion

from airflow import DAG
from airflowoperators import PythonOperator
from datetime import datetime timedelta
import logging

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_python_function' default_args args)

#
# python function
#
def print_context(*args **kwargs)

# check if args has 2 parameters
args args if len(args) 2 else [None]*2

# write log information to the Airflow logs
logginginfo('')
logginginfo('Airflow Task Context {}'format(kwargs))
logginginfo('Airflow PythonOperator op_args Parameters args[0]{} args[1]{}'format(args[0] args[1]))
logginginfo('Airflow PythonOperator op_kwargs Parameters param1{} param2{}'format(kwargsget('param1') kwargsget('param2')))
logginginfo('')

return 'This will be write into airflow log'

#
# task
#
# python callable without context
task1 PythonOperator(
task_id'python_without_context'
provide_contextFalse
python_callableprint_context
dagdag)

# python callable with context
task2 PythonOperator(
task_id'python_with_context'
provide_contextTrue
python_callableprint_context
dagdag)

# python callable pass parameters
task3 PythonOperator(
task_id'python_pass_params'
provide_contextTrue
python_callableprint_context
op_args['args value1' datetimenow()]
op_kwargs{'param1' 'kwargs value1' 'param2' datetimenow() }
dagdag)

#
# Dependency
#
task2set_upstream(task1)

if __name__ __main__
dagcli()
CLI传递参数手动触发DAG
airflow_example_pass_paramspy
# *codingutf8*

# This is the example of pass parameters from CLI when manully tigger the DAG
# The following is the Airflow CLI

# airflow trigger_dag airflow_example_pass_params c '{messagemanual value}'

from airflow import DAG
from airflowoperators import BashOperator PythonOperator
from datetime import datetime timedelta
import logging

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_pass_params' default_args args)

#
# python function
#
def print_context(**kwargs)
msg None
if(kwargs['dag_run']conf is not None)
msg kwargs['dag_run']confget('message')

logginginfo('Messages from CLI {}'format(msg if msg is not None else 'default value'))

#
# task
#
task1 PythonOperator(
task_id'python_task'
provide_contextTrue
python_callableprint_context
dagdag)

task2 BashOperator(
task_id'bash_task'
bash_command'echo Messages from CLI {{ dag_runconf[message] if dag_run and dag_runconf[message] else default value }}'
dagdag)

#
# Dependency
#
task2set_upstream(task1)

if __name__ __main__
dagcli()
DAG中跳务
airflow_example_skip_taskpy
# *codingutf8*

# Example for skip task

from airflow import DAG
from airflowoperators import PythonOperator DummyOperator BranchPythonOperator ShortCircuitOperator
from airflowutilstrigger_rule import TriggerRule
from airflowexceptions import AirflowSkipException
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_skip_task' default_args args)

#
# python function
#
def shortcircuit_example1_fun()
'''
return 0(False) or 1(True) according to seconds of execution time
'''
return bool(int(datetimenow()strftime('S')) 2)

def branch_example2_fun(**kwargs)
'''
if dag is triggered by scheduler or manually triggered without parameter fork_num
the default task the example2_fork1
'''
fork_num 1
if(kwargs['dag_run']conf is not None and kwargs['dag_run']confget('fork_num') is not None)
fork_num kwargs['dag_run']confget('fork_num')

return 'example2_fork{}'format(fork_num)

def exception_example3_fun()
'''
manully raise a skip exception
'''
raise AirflowSkipException('Airflow skip exection is raised')

#
# Use ShortCircuitOperator to skip task and all downstream tasks
#
example1_start DummyOperator(task_id'example1_start' dagdag)
example1_branch DummyOperator(task_id'example1_branch' dagdag)
example1_fork1 DummyOperator(task_id'example1_fork1' dagdag)
example1_fork2_shortcircuit ShortCircuitOperator(task_id'example1_fork2_shortcircuit'
python_callableshortcircuit_example1_fun
dagdag)
example1_fork2 DummyOperator(task_id'example1_fork2' dagdag)
example1_done DummyOperator(task_id'example1_done' trigger_ruleTriggerRuleALL_DONE dagdag)
example1_end DummyOperator(task_id'example1_end' dagdag)

example1_startset_downstream(example1_branch)
example1_branchset_downstream(example1_fork1)
example1_branchset_downstream(example1_fork2_shortcircuit)
example1_fork2_shortcircuitset_downstream(example1_fork2)
example1_fork1set_downstream(example1_done)
example1_fork2set_downstream(example1_done)
example1_doneset_downstream(example1_end)

#
# Default Branch can change branch by parameter passed by cli when manully
# trigger DAG
#
# default branch is fork1 branch use the following cli to trigger dag to
# execute fork2 branch
# so you can manully choose which task to skip
# airflow trigger_dag airflow_example_task_branch c '{fork_num2}'
example2_start DummyOperator(task_id'example2_start' dagdag)
example2_branch BranchPythonOperator(task_id'example2_branch'
provide_contextTrue
python_callablebranch_example2_fun
dagdag)
example2_fork1 DummyOperator(task_id'example2_fork1' dagdag)
example2_fork2 DummyOperator(task_id'example2_fork2' dagdag)
example2_done DummyOperator(task_id'example2_done' trigger_ruleTriggerRuleALL_DONE dagdag)
example2_end DummyOperator(task_id'example2_end' dagdag)

example2_startset_downstream(example2_branch)
example2_branchset_downstream(example2_fork1)
example2_branchset_downstream(example2_fork2)
example2_fork1set_downstream(example2_done)
example2_fork2set_downstream(example2_done)
example2_doneset_downstream(example2_end)

#
# Raise Exception to skip task
#
example3_start DummyOperator(task_id'example3_start' dagdag)
example3_branch DummyOperator(task_id'example3_branch' dagdag)
example3_fork1 DummyOperator(task_id'example3_fork1' dagdag)
example3_fork2_exception ShortCircuitOperator(task_id'example3_fork2_exception'
python_callableexception_example3_fun
dagdag)
example3_fork2 DummyOperator(task_id'example3_fork2' dagdag)
example3_done DummyOperator(task_id'example3_done' trigger_ruleTriggerRuleALL_DONE dagdag)
example3_end DummyOperator(task_id'example3_end' dagdag)

example3_startset_downstream(example3_branch)
example3_branchset_downstream(example3_fork1)
example3_branchset_downstream(example3_fork2_exception)
example3_fork2_exceptionset_downstream(example3_fork2)
example3_fork1set_downstream(example3_done)
example3_fork2set_downstream(example3_done)
example3_doneset_downstream(example3_end)

if __name__ __main__
dagcli()
DAG中务赖项
airflow_example_depends_dag_firstpy
# *codingutf8*

# Example for tasks dependency that not in the same DAG
# airflow_example_depends_dag_first > airflow_example_depends_dag_second
# airflow_example_depends_dag_second will depends on bash_task in airflow_example_depends_dag_first

from airflow import DAG
from airflowoperators import DummyOperator BashOperator
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_depends_dag_first' default_args args)

#
# task
#
start DummyOperator(task_id'start' dagdag)
bash_task BashOperator(task_id'bash_task'
bash_command'sleep 300'
dagdag)
end DummyOperator(task_id'end' dagdag)

#
# Dependency
#
startset_downstream(bash_task)
bash_taskset_downstream(end)

if __name__ __main__
dagcli()
airflow_example_depends_dag_secondpy
# *codingutf8*

# Example for tasks dependency that not in the same DAG
# airflow_example_depends_dag_first > airflow_example_depends_dag_second
# airflow_example_depends_dag_second will depends on bash_task in airflow_example_depends_dag_first

# Note ExternalTaskSensor assumes that you are dependent on a task in a dag run with the same execution date
# or you can setup execution timedelta via either execution_delta or execution_date_fn

from airflow import DAG
from airflowoperators import DummyOperator BashOperator
from airflowoperators import ExternalTaskSensor
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_depends_dag_second' default_args args)

#
# python function
#
def exec_time()
'''
return datetimetimedelta object which determine the execution time
'''
return ''

#
# task
#
start DummyOperator(task_id'start' dagdag)
sensor ExternalTaskSensor(task_id'sensor'
external_dag_id'airflow_example_depends_dag_first'
external_task_id'bash_task'
dagdag)
task DummyOperator(task_id'task' dagdag)
end DummyOperator(task_id'end' dagdag)

#
# Dependency
#
startset_downstream(sensor)
sensorset_downstream(task)
taskset_downstream(end)

if __name__ __main__
dagcli()
触发DAG
airflow_example_trigger_controller_dagpy
# *codingutf8*

# Example for trigger DAG
# DAG info reads from Airflow Variables

This example illustrates the use of the TriggerDagRunOperator There are 2
entities at work in this scenario
1 The Controller DAG the DAG that conditionally executes the trigger
2 The Target DAG DAG being triggered (in example_trigger_target_dagpy)

A TriggerDagRunOperator will be generated dynamically according to the
Airflow Variable TRIGGER_DAG TRIGGER_DAG stors the dag_ids that need to be triggered


from airflow import DAG
from airflowoperators import DummyOperator PythonOperator TriggerDagRunOperator
from airflowmodels import Variable
from datetime import datetime timedelta
import logging

# init the Airflow Vairiables
# for prod env you should setup the Vairiable out side DAG script in advance
Variableset('TRIGGER_DAGS' ['airflow_example_trigger_target_dag'] serialize_jsonTrue)

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_trigger_controller_dag' default_args args)

#
# python function
#
def gen_trigger_dag_operator(upstream_task)
'''generate TriggerDagRunOperator according to Variable TRIGGER_DAGS'''
# get Vairable TRIGGER_DAGS
trigger_dags Variableget('TRIGGER_DAGS' deserialize_jsonTrue) or []
for dag_id in trigger_dags
task TriggerDagRunOperator(task_id'trigger_{}'format(dag_id)
trigger_dag_iddag_id
python_callablelambda ctx dag_run_obj dag_run_obj
dagdag)
taskset_upstream(upstream_task)

#
# Task
#
start DummyOperator(task_id'start' dagdag)
gen_trigger_dag_operator(start)

if __name__ __main__
dagcli()
airflow_example_trigger_target_dagpy
# *codingutf8*

# Example for trigger DAG
# DAG info reads from Airflow Variables

This example illustrates the use of the TriggerDagRunOperator There are 2
entities at work in this scenario
1 The Controller DAG the DAG that conditionally executes the trigger
2 The Target DAG DAG being triggered (in example_trigger_target_dagpy)

A TriggerDagRunOperator will be generated dynamically according to the
Airflow Variable TRIGGER_DAG TRIGGER_DAG stors the dag_ids that need to be triggered

from airflowmodels import DAG
from airflowoperators import DummyOperator
from datetime import datetime timedelta
import logging

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_trigger_target_dag'
default_argsargs
schedule_intervalNone)

#
# Task
#
run_this DummyOperator(task_id'run_this' dagdag)

if __name__ __main__
dagcli()
进阶实例
· airflow_example_dagpy 读取指定DAG元数信息
· airflow_example_dagbagpy 通DAGBag获取DAG象
· airflow_example_user_defined_macrospy 创建DAG时定义宏
· airflow_example_inherit_operatorpy 创建operator现operator继承
读取指定DAG元数信息
airflow_example_dagpy
# *codingutf8*

# Example of DAG

from airflow import DAG
from airflowoperators import PythonOperator
from airflowmodels import DagBag
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_dag' default_args args)

#
# python function
#
def dag_fun(**kwargs)
dagbag DagBag()

# get specific DAG Object
dag_id 'airflow_example_dagbag'
dag dagbagdagsget(dag_id)

dagbagloggerinfo('')
dagbagloggerinfo(kwargs)
# Get all Tasks from DAG object
# task objects
dagbagloggerinfo('task objects{}'format(dagtasks))
# task_ids
dagbagloggerinfo('task ids{}'format(dagtask_ids))

# Get execute date information
# Returns the latest date for which at least one task instance exists
dagbagloggerinfo('latest executtion date{}'format(daglatest_execution_date))

# Get all DAGrun object
dagbagloggerinfo('')

#
# task
#
python_task PythonOperator(task_id'python_task'
provide_contextTrue
python_callabledag_fun
dagdag)

if __name__ __main__
dagcli()

通DAGBag获取DAG象
airflow_example_dagbagpy
# *codingutf8*

# Example of DAGbag

from airflow import DAG
from airflowoperators import PythonOperator
from airflowmodels import DagBag
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1
}

#
# dag
#
dag DAG('airflow_example_dagbag' default_args args)

#
# python function
#
def dagbag_fun(**kwargs)
dagbag DagBag()

dagbagloggerinfo('')
dagbagloggerinfo(kwargs)
# Get all DAGs from Dagbag object
for key val in dagbagdagsitems()
dagbagloggerinfo('{}{}'format(key valtask_ids))
dagbagloggerinfo('')

#
# task
#
python_task PythonOperator(task_id'python_task'
provide_contextTrue
python_callabledagbag_fun
dagdag)

if __name__ __main__
dagcli()
创建DAG时定义宏
airflow_example_user_defined_macrospy
# *codingutf8*

# Example of user defined macros

from airflow import DAG
from airflowoperators import BashOperator
from datetime import datetime timedelta

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1

}

#
# python function
#
def user_defined_fun(ds)
'''
get date that 30 days before the current execution_date
'''
ds datetimestrptime(ds 'Ymd')
ds ds + timedelta(30)
return dsisoformat()[10]

#
# dag
#
dag DAG('airflow_example_user_defined_macros'
default_argsargs
user_defined_macros{
'user_defined_fun' user_defined_fun
})

#
# task
#
bash_task BashOperator(task_id'bash_task'
bash_command'echo {{ ds }} {{ user_defined_fun(ds) }}'
dagdag)

if __name__ __main__
dagcli()
创建operator现operator继承
airflow_example_inherit_operatorpy
# *codingutf8*

# Example of inherit operator

from airflow import DAG
from airflowoperators import BashOperator
from airflowmodels import BaseOperator
from datetime import datetime timedelta
import logging

#
# args
#
days_ago datetimecombine(datetimetoday() timedelta(1) datetimemintime())
args {
'owner' 'airflow'
'depends_on_past' False
'start_date' days_ago
'email' []
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
'max_active_runs' 1

}

class InheritBaseOperator(BaseOperator)
'''
Inherit from BaseOperator to create user own new Operator
here is only for demostration that this new Operator is just
to record context info into log
'''

def execute(self context)
logginginfo('')
logginginfo('This is user defined Operator the following is info from context')
logginginfo(context)
logginginfo('')


class InheritBashOperator(BashOperator)
'''
Inherit from BashOperator overwriting method render_template
set a new variable user_defined_ds which is date that 30 days before the current execution_date
'''
def render_template(self attr content context)
execution_date context['execution_date']
context['user_defined_ds'] execution_date + timedelta(30)

return super(InheritBashOperator self)render_template(attr content context)

#
# dag
#
dag DAG('airflow_example_inherit_operator' default_argsargs)

#
# task
#
new_op InheritBaseOperator(task_id'new_op_task' dagdag)
bash_task InheritBashOperator(task_id'bash_task'
bash_command'echo {{ user_defined_ds }}'
dagdag)

if __name__ __main__
dagcli()
完整实例
baozun_network_inventory
httpairflowchinabiengineeringe1ngap2nikecom8080adminairflowgraphroot&dag_idbaozun_network_inventory




PySpark代码
# * encoding utf8 *


from airflow import DAG
from airflowmodels import Variable
from airflowoperators import (
BatchStartOperator
BatchEndOperator
EmrOperator
GenieSparkOperator
PythonOperator
BashOperator
GenieSqoopOperator
SlackOperator
SnowFlakeOperator
)
from batch_common import DMError BICommon
from datetime import datetime timedelta
import ast


#
# VARIABLES
#
# system variables
profile BICommon()get_profile()
job_user profile[JOB_USER]
env profile[ENV]
DEFAULT_QUEUE airflow

# project variables
env_config astliteral_eval(Variableget(se_project_config))
CLUSTER_NAME env_configget(CLUSTER_NAME)
S3_BUCKET env_configget(S3_BUCKET)
SCRIPT_BUCKET env_configget(SCRIPT_BUCKET)
DML_PATH env_configget(DML_PATH)
schedule_interval env_configget('SCHEDULE_INTERVAL_BAOZUN_NETWORK_INVENTORY')

# code path
BAOZUN_NETWORK_INVENTORY_SCRIPT env_configget('BAOZUN_NETWORK_INVENTORY_SCRIPT')
ONETIME_TABLE_SCRIPT env_configget('ONETIME_TABLE_SCRIPT')

#
# dag
#
dag_id baozun_network_inventory

default_args {
'owner' 'Vivian Zhao'
'start_date' datetime(2019 10 10)
'email' ['VivianZhao2@nikecom']
'email_on_failure' True
'email_on_retry' False
'retries' 3
'retry_delay' timedelta(seconds60)
'queue' DEFAULT_QUEUE
'wait_for_downstream' False
}


dag DAG(dag_id
default_argsdefault_args
schedule_intervalschedule_interval)


#
# tasks
#
start BatchStartOperator(dagdag)

spark_s3_cleansed GenieSparkOperator(task_id'onetime_tables_cleansed'
command'{onetime_table_script}'format(onetime_table_scriptONETIME_TABLE_SCRIPT)
queueDEFAULT_QUEUE
sched_typeCLUSTER_NAME
dagdag)


# semantic
spark_s3_semantic GenieSparkOperator(task_id'baozun_network_inventory_semantic'
command'{sementic_script}'format(sementic_scriptBAOZUN_NETWORK_INVENTORY_SCRIPT)
queueDEFAULT_QUEUE
sched_typeCLUSTER_NAME
dagdag)


end BatchEndOperator(dagdag)


#
# snowflake operator
#
load_snowflake SnowFlakeOperator(task_id'load_snowflake'
sql_file'{dml_path}BAOZUN_NETWORK_INVENTORYsql'format(dml_pathDML_PATH)
conn_id'snowflake'
dagdag)

#
# dependency
#
startset_downstream(spark_s3_cleansed)
spark_s3_cleansedset_downstream(spark_s3_semantic)
spark_s3_semanticset_downstream(load_snowflake)
load_snowflakeset_downstream(end)

BashOperator
BashOperatorBash shell中执行命令
run_this BashOperator(
task_id'run_after_loop' bash_command'echo 1' dagdag)
模板
You can use Jinja templates to parameterize the bash_command argument
task BashOperator(
task_id'also_run_this'
bash_command'echo run_id{{ run_id }} | dag_run{{ dag_run }}'
dagdag)
障排
Jinja template not found
Add a space after the script name when directly calling a Bash script with the bash_command argument This is because Airflow tries to apply a Jinja template to it which will fail bash_command参数直接调Bash脚时脚名称添加空格Airflow尝试Jinja模板应失败
t2 BashOperator(
task_id'bash_example'

# This fails with `Jinja template not found` error
# bash_commandhomebatchertestsh

# This works (has a space after)
bash_commandhomebatchertestsh
dagdag)
EMR Operator
Operator to spin up or terminate an EMR

Parameters
· cluster_action (Required)  Must be either 'spinup' or 'terminate'
· cluster_name (Required)  Specify any name you like just be advised that if you have more than one DAG using the same cluster_name that they can submit their jobs to each others' clusters  Cluster name characters limit is set to less than 55 Please make sure not to exceed the limit (Genie routing is by cluster_name)
· custom_dns (Optional) Custom_dns Field to be used for creating Route53 DNS entries
· num_core_nodes (Required) Specify the number of EMR core vm nodes to spinup  Please use appropriate discretion and do not allocate more cores than necessary
· classification (Required) Data security level choose from bronze silver gold platinum
· project_id (Required) Enter your Relay code  Currently the selection is FYSHAREDBI FY150038CK FY150062DEF FY12126320 Acxiom GCD FY12126395RDF FY12126348CBIT3170113
· cost_center (Required) Enter the cost center number for your project
· num_task_nodes (Optional) Specify a number of EMR task vm nodes to request using ec2 Spot Instances  Spot Instances may or may not be available to meet the request
· master_inst_type(Optional) ec2 type for master instances  Allowed values are m32xlarge r32xlarge r34xlarge r38xlarge  If you omit this parameter we'll use the first value (m32xlarge)
· core_inst_type (Optional) ec2 type for core instances  Allowed values are m32xlarge r32xlarge r34xlarge  If you omit this parameter we'll use the first value (m32xlarge)
· task_int_type (Optional) ec2 type for task instances  Allowed values are m32xlarge r32xlarge r34xlarge  If you omit this parameter we'll use core_int_type or the first value (m32xlarge)
· core_bid_type (Not used) bid type for core instance can be specified SPOT or ON_DEMAND For NGAP2 currently it's defaulting to ON_DEMAND
· task_bid_type(Not used) bid type for task instance can be specified SPOT or ON_DEMAND For NGAP2 currently it's defaulting to SPOT
· emr_version (Required) version of EMR can be specified which is used to spin up EMR values are 570 590 5110 5130
· applications (Optional) List of applications to be installed in EMR can be specifiedBy default hivepigspark will be installed Based on the EMR version user can give the list of applications If the application given is not in the EMR version specified EMR spin up will failAlso if sqoop is selected as an application only hive will be installed with itno other application is allowed with sqoop
· properties (Optional) List of configuration properties to pass to configure each instance group see Note
· bootstrap_actions (Optional) Bootstrap scripts to run at the EMR nodes spin up time See Note
· long_running_cluster (Optional) True or False Default is False
· auto_scaling To use Auto Scaling feature in EMR set auto_scaling as True by default it will be False
1
a core_max  Maximum limit for core nodes to scale out
b core_min  Minimum limit for core nodes to scale in
c task_max  Maximum limit for task nodes to scale out
d task_min  Minimum limit for task nodes to scale in
e core_scale_up  Number of core nodes to scale out each timeDefault value is 1
f core_scale_down  Number of core nodes to scale in each timeDefault value is 1
g task_scale_up  Number of task nodes to scale out each timeDefault value is 2
h task_scale_down  Number of task nodes to scale in each timeDefault value is 2
· email (Required) Specify one or more email addresses to receive alerts in case of error  The first email address will be added as an EC2 tag on all EMR nodes for auditing purposes
· on_failure_callback (Optional) Specify a function that terminates the EMR cluster in case of error  (Otherwise the cluster will keep running and costing you money)
· queue (required) always use 'airflow'
· is_instance_fleet (optional boolean) A boolean used to specify if you are using instance fleets
· instance_fleets (Required when using instance fleets list) A list of InstanceFleet objects refer to Instance Fleet documentation Example below
1 instance_fleet_typestrThe node type of the instance fleet[optional]
2 target_on_demand_capacityintThe target capacity of Spot units for the instance fleet[optional]
3 target_spot_capacityintThe target capacity of Spot units for the instance fleet[optional]
4 instance_type_configsInstanceTypeConfigThe InstanceTypeConfigs for EMR instance fleets[optional]
5 launch_specificationsLaunchSpecificationAn empty wrapper object around the spot spec assuming aws will add props later[optional]
· cerebro_cdas (Optional) Cerebro Field to use both metastore and cerebro data access service
· cerebro_hms(Optional) Cerebro Field to use only the cerebro metastore
· tags (Optional) We can use this option to add custom tags to the EMR Cluster For Enabling Cerebro using tags please see the description below



NOTE

1 If sqoop is selected as an application along that only hive will be installed in EMR Sqoop and other applications like sparkzepplin cannot be in same EMR
2  properties for setting up emrfs consistent view
· properties[{Classification emrfssiteProperties {fss3consistent true}}]
3 By default speculative execution will be false to set it to true
· properties[{Classification mapredsiteProperties {mapreducemapspeculative truemapreducereducespeculative true}}{Classification hivesiteProperties {hivemapredreducetasksspeculativeexecutiontrue}}]
4 To add bootstrap action to the EMR spin up  Here is the format
bootstrap_actions[{Name
ScriptBootstrapAction {
Path
Args [arguments]
}}]
eg 
bootstrap_actions[{Name EMR Bootstrap smoke test
ScriptBootstrapAction {
Path s3nikeemrbinemr_bootstrap_hello2sh
Args []
}}]

      2 By default speculative execution will be false to set it to true

实例:
To spin up a cluster
task1  EmrOperator(
    task_id'EmrSpinup'
    cluster_action'spinup'
    cluster_name'CLUSTER_NAME'
        custom_dns'True'
    num_task_nodes1
    num_core_nodes1
    classification'bronze'
    project_id'PROJECT_ID'
    queue'airflow'
    cost_center'COST_CENTER'
    emr_version'5110'
        master_inst_type'm3xlarge' #optional
    applications['hive' 'pig' 'spark']
auto_scalingTrue #optional Use it only if autoscaling is required
core_max2 #optionalrequired only if autoscaling is true
core_min1 #optionalrequired only if autoscaling is true
task_max2 #optionalrequired only if autoscaling is true
task_min1 #optionalrequired only if autoscaling is true
core_scale_up1 #optional
core_scale_down1 #optional
task_scale_up1 #optional
task_scale_down1 #optional
    dagdag
)
To Terminate a cluster
task6  EmrOperator(
    task_id'EmrTerminate'
    cluster_action'terminate'
    cluster_name'CLUSTER_NAME'
    queue'airflow'
    dagdag
)
Instance fleet example


from airflow import DAG
from pluginsemr_plugin import EmrOperator
from datetime import datetime
from datetime import timedelta
from emr_clientmodelsinstance_fleet import InstanceFleet
from emr_clientmodelsinstance_type_config import InstanceTypeConfig
from emr_clientmodelslaunch_specification import LaunchSpecification
from emr_clientmodelsspot_specification import SpotSpecification
 
 
CLUSTER_NAME instance_fleet_spot_master_mixed_core_mixed_task
 
default_args {
    'owner' 'airflow'
    'depends_on_past' False
    'start_date' datetime(20xx x xx)
    'retries' 1
    'retry_delay' timedelta(minutes5)
    'queue' 'airflow'
}
 
instance_fleets [
    InstanceFleet(
        instance_fleet_typeMASTER
        target_spot_capacity1
        instance_type_configs[
            InstanceTypeConfig(
                bid_price_as_percentage_of_on_demand_price100
                instance_typem3xlarge
                weighted_capacity1
            )
        ]
        launch_specificationsLaunchSpecification(
            spot_specificationSpotSpecification(
                block_duration_minutes60
                timeout_actionTERMINATE_CLUSTER
                timeout_duration_minutes5
            )
        )
    )
    InstanceFleet(
        instance_fleet_typeCORE
        target_on_demand_capacity1
        target_spot_capacity3
        instance_type_configs[
            InstanceTypeConfig(
                bid_price_as_percentage_of_on_demand_price100
                instance_typem3xlarge
                weighted_capacity1
            )
            InstanceTypeConfig(
                bid_price_as_percentage_of_on_demand_price100
                instance_typem32xlarge
                weighted_capacity2
            )
        ]
        launch_specificationsLaunchSpecification(
            spot_specificationSpotSpecification(
                block_duration_minutes60
                timeout_actionTERMINATE_CLUSTER
                timeout_duration_minutes5
            )
        )
    )
    InstanceFleet(
        instance_fleet_typeTASK
        target_on_demand_capacity1
        target_spot_capacity3
        instance_type_configs[
            InstanceTypeConfig(
                bid_price_as_percentage_of_on_demand_price100
                instance_typem3xlarge
                weighted_capacity1
            )
            InstanceTypeConfig(
                bid_price_as_percentage_of_on_demand_price100
                instance_typem32xlarge
                weighted_capacity2
            )
        ]
        launch_specificationsLaunchSpecification(
            spot_specificationSpotSpecification(
                block_duration_minutes60
                timeout_actionTERMINATE_CLUSTER
                timeout_duration_minutes5
            )
        )
    )
]
 
dag DAG(
    'test_instance_fleet'
    default_argsdefault_args
    schedule_interval'0 0 * * *'
)
 
task1 EmrOperator(
    task_id'EmrSpinup'
    cluster_action'spinup'
    cluster_nameCLUSTER_NAME
    custom_dns'True'
    num_task_nodes1
    num_core_nodes1
    classification'bronze'
    project_id'PROJECT_ID'
    queue'airflow'
    cost_center'COST_CENTER'
    emr_version'530'
    master_inst_type'm3xlarge'  # optional
    applications['hive' 'pig' 'spark']
    is_instance_fleetTrue
    instance_fleetsinstance_fleets
    dagdag
)
 
task2 EmrOperator(
    task_id'EmrTerminate'
    cluster_action'terminate'
    cluster_nameCLUSTER_NAME
    queue'airflow'
    dagdag
)
 
task2set_upstream(task1)

Cerebro Changes for EMR Operator
We can enable role based read access through CerebroOkera we need to use cerebro_cdas flag into EMROperator
By Default when the flag is set to True the EMR Spun up will look up the Environment Specific CDAS Cluster
IE EMR Spun up from Airflow DEV will use the CDAS Cluster mapped as DEV for the respective NGAP2 AD Group
In addition the variable can be setup in Airflow Variables which will override the default environment based CDAS cluster selection for all the DAGs hosted in an Airflow Cluster
Example
task1 EmrOperator(
task_id'EmrSpinup'
cluster_action'spinup'
cluster_nameCLUSTER_NAME
num_task_nodes1
num_core_nodes1
classification'bronze'
queue'airflow'
cost_center'COST_CENTER'
emr_version'520'
applications['hive' 'pig' 'spark']
cerebro_cdasTrue
dagdag
)
Airflow Variable to Override Environment based CDAS selection

In case we need to override the default setting then we can provide a custom TAG pointing to the specific CDAS cluster
task1 EmrOperator(
task_id'EmrSpinup'
cluster_action'spinup'
cluster_nameCLUSTER_NAME
num_task_nodes1
num_core_nodes1
classification'bronze'
queue'airflow'
cost_center'COST_CENTER'
emr_version'520'
applications['hive' 'pig' 'spark']
cerebro_cdasTrue
tags[{Keycerebro_clusterValueUNIFYTEST}]
dagdag
)

Custom_dns for EMR Operator
In case if team want to create a custom dns entries follow the syntax as below
task1 EmrOperator(
task_id'EmrSpinup'
cluster_action'spinup'
cluster_nameCLUSTER_NAME
num_task_nodes1
num_core_nodes1
classification'bronze'
queue'airflow'
cost_center'COST_CENTER'
emr_version'5210'
applications['hive''spark']
custom_dnsTrue
dagdag
)

S3BoxFileTransferOperator
Uploads files from Box  to S3 Available only on NGAP 20 airflow
Prerequisite in order to use this operator you will have to create a Airflow support request to create a folder in the ngap2 box application httpsnikeentboxcomprofile4003105259
Parameters
· conn_id (string) – The box connection name as configured in Admin → Connections
· box_dir (string) – Location file which needs to be Uploaded
· box_file_name (String) file name for renaming
· s3path(String) Full S3 path
· s3_Region (String) provide respective resgion where your bucket resides in
· file_type(String) provide file types
· box_folder_id (String) For every Folder There is an ID (which is provided by Platform team )
NOTE
1 This is a NGAP 20 only feature All the connection details pertaining to your batch user needs to be configured in connections
2 This Operator moves all files from given box folder to specified s3 prefix
 Airflow Migration Status
实例:
s3_box_transfer_task2 S3BoxFileTransferOperator(
task_id'box_s3_step'
queue'airflow'
s3path 's3nikeemrbinshabarishtestairflowdagstest'
box_folder_id '53639622254'
s3_Region 'useast1'
dagdag
)

PythonOperator
PythonOperator执行Python callables
def print_context(ds **kwargs)
pprint(kwargs)
print(ds)
return 'Whatever you return gets printed in the logs'


run_this PythonOperator(
task_id'print_the_context'
provide_contextTrue
python_callableprint_context
dagdag)
传递参数
op_argsop_kwargs参数额外参数传递Python调象
def my_sleeping_function(random_base)
This is a function that will run within the DAG execution
timesleep(random_base)


# Generate 10 sleeping tasks sleeping from 0 to 4 seconds respectively
for i in range(5)
task PythonOperator(
task_id'sleep_for_' + str(i)
python_callablemy_sleeping_function
op_kwargs{'random_base' float(i) 10}
dagdag)

taskset_upstream(run_this)
模板
您provide_context参数设置True时Airflow会传入组额外关键字参数:Jinja模板变量templates_dict参数
templates_dict参数模板化字典中值估算Jinja模板
GenieHiveOperator
Hive命令提交Genie
Parameters
· command (string) – 提交Genie命令
· job_name (string) – Job Name to be displayed on Genie console The execution date will be appended in Genie operator automatically Genie控制台显示作业名称 执行日期动附加Genie运算符中
· sched_type (string) Optional – 提交作业集群类型值包括:adhocETL
· file_dependencies(string)Optional A comma delimited list of files to be uploaded to Genie for use of submitting job  选 逗号分隔文件列表文件传Genie便提交作业
eg filehomeubuntutestSimpleApppyfilehomeubuntutestloremtxt
· queue required always use 'airflow'
实例:
hive_cmd  f s3inbounddatascienceadhocngap2_beta_testsaf_test_scriptssamplehivejobhql
task3  GenieHiveOperator(
    task_id'TestHiveStep'
    commandhive_cmd
    job_name'hivegeniejob'
    queue'airflow'
    sched_type'CLUSTER_NAME'
    dagdag
)
Note 
For Okera enabled EMR clusters (for PII data access) the hive query files (hql) need to have the below commands added in the start 启OkeraEMR群集(PII数访问)配置单元查询文件(hql)必须开始时添加命令
add jar hdfsuserhadooplibokerahivemetastorejar
add jar hdfsuserhadooplibrecordservicehivejar

Google Cloud Platform Operators
GoogleCloudStorageToBigQueryOperator
Use the GoogleCloudStorageToBigQueryOperator to execute a BigQuery load job
load_csv gcs_to_bqGoogleCloudStorageToBigQueryOperator(
task_id'gcs_to_bq_example'
bucket'cloudsamplesdata'
source_objects['bigqueryusstatesusstatescsv']
destination_project_dataset_table'airflow_testgcs_to_bq_table'
schema_fields[
{'name' 'name' 'type' 'STRING' 'mode' 'NULLABLE'}
{'name' 'post_abbr' 'type' 'STRING' 'mode' 'NULLABLE'}
]
write_disposition'WRITE_TRUNCATE'
dagdag)
GceInstanceStartOperator
Allows to start an existing Google Compute Engine instance
In this example parameter values are extracted from Airflow variables Moreover the default_args dict is used to pass common arguments to all operators in a single DAG
PROJECT_ID modelsVariableget('PROJECT_ID' '')
LOCATION modelsVariableget('LOCATION' '')
INSTANCE modelsVariableget('INSTANCE' '')
SHORT_MACHINE_TYPE_NAME modelsVariableget('SHORT_MACHINE_TYPE_NAME' '')
SET_MACHINE_TYPE_BODY {
'machineType' 'zones{}machineTypes{}'format(LOCATION SHORT_MACHINE_TYPE_NAME)
}

default_args {
'start_date' airflowutilsdatesdays_ago(1)
}
Define the GceInstanceStartOperator by passing the required arguments to the constructor
gce_instance_start GceInstanceStartOperator(
project_idPROJECT_ID
zoneLOCATION
resource_idINSTANCE
task_id'gcp_compute_start_task'
)
GceInstanceStopOperator
Allows to stop an existing Google Compute Engine instance
For parameter definition take a look at GceInstanceStartOperator above
Define the GceInstanceStopOperator by passing the required arguments to the constructor
gce_instance_stop GceInstanceStopOperator(
project_idPROJECT_ID
zoneLOCATION
resource_idINSTANCE
task_id'gcp_compute_stop_task'
)
GceSetMachineTypeOperator
Allows to change the machine type for a stopped instance to the specified machine type
For parameter definition take a look at GceInstanceStartOperator above
Define the GceSetMachineTypeOperator by passing the required arguments to the constructor
gce_set_machine_type GceSetMachineTypeOperator(
project_idPROJECT_ID
zoneLOCATION
resource_idINSTANCE
bodySET_MACHINE_TYPE_BODY
task_id'gcp_compute_set_machine_type'
)
GcfFunctionDeleteOperator
Use the default_args dict to pass arguments to the operator
PROJECT_ID modelsVariableget('PROJECT_ID' '')
LOCATION modelsVariableget('LOCATION' '')
ENTRYPOINT modelsVariableget('ENTRYPOINT' '')
# A fullyqualified name of the function to delete

FUNCTION_NAME 'projects{}locations{}functions{}'format(PROJECT_ID LOCATION
ENTRYPOINT)
default_args {
'start_date' airflowutilsdatesdays_ago(1)
}
Use the GcfFunctionDeleteOperator to delete a function from Google Cloud Functions
t1 GcfFunctionDeleteOperator(
task_idgcf_delete_task
nameFUNCTION_NAME
)
Troubleshooting
If you want to run or deploy an operator using a service account and get forbidden 403 errors it means that your service account does not have the correct Cloud IAM permissions
1 Assign your Service Account the Cloud Functions Developer role
2 Grant the user the Cloud IAM Service Account User role on the Cloud Functions runtime service account
The typical way of assigning Cloud IAM permissions with gcloud is shown below Just replace PROJECT_ID with ID of your Google Cloud Platform project and SERVICE_ACCOUNT_EMAIL with the email ID of your service account
gcloud iam serviceaccounts addiampolicybinding \
PROJECT_ID@appspotgserviceaccountcom \
memberserviceAccount[SERVICE_ACCOUNT_EMAIL] \
rolerolesiamserviceAccountUser
See Adding the IAM service agent user role to the runtime service for details
GcfFunctionDeployOperator
Use the GcfFunctionDeployOperator to deploy a function from Google Cloud Functions
The following examples of Airflow variables show various variants and combinations of default_args that you can use The variables are defined as follows
PROJECT_ID modelsVariableget('PROJECT_ID' '')
LOCATION modelsVariableget('LOCATION' '')
SOURCE_ARCHIVE_URL modelsVariableget('SOURCE_ARCHIVE_URL' '')
SOURCE_UPLOAD_URL modelsVariableget('SOURCE_UPLOAD_URL' '')
SOURCE_REPOSITORY modelsVariableget('SOURCE_REPOSITORY' '')
ZIP_PATH modelsVariableget('ZIP_PATH' '')
ENTRYPOINT modelsVariableget('ENTRYPOINT' '')
FUNCTION_NAME 'projects{}locations{}functions{}'format(PROJECT_ID LOCATION
ENTRYPOINT)
RUNTIME 'nodejs6'
VALIDATE_BODY modelsVariableget('VALIDATE_BODY' True)

With those variables you can define the body of the request
body {
name FUNCTION_NAME
entryPoint ENTRYPOINT
runtime RUNTIME
httpsTrigger {}
}
When you create a DAG the default_args dictionary can be used to pass the body and other arguments
default_args {
'start_date' datesdays_ago(1)
'project_id' PROJECT_ID
'location' LOCATION
'body' body
'validate_body' VALIDATE_BODY
}
Note that the neither the body nor the default args are complete in the above examples Depending on the set variables there might be different variants on how to pass source code related fields Currently you can pass either sourceArchiveUrl sourceRepository or sourceUploadUrl as described in the CloudFunction API specification Additionally default_args might contain zip_path parameter to run the extra step of uploading the source code before deploying it In the last case you also need to provide an empty sourceUploadUrl parameter in the body 请注意示例中body默认args完整 根设置变量传递源代码相关字段会变体 前您CloudFunction API规范中说明传递sourceArchiveUrlsourceRepositorysourceUploadUrl 外default_args包含zip_path参数运行部署源代码前传源代码额外步骤 种情况您需正文中提供空sourceUploadUrl参数
Based on the variables defined above example logic of setting the source code related fields is shown here 根面定义变量处显示设置源代码相关字段示例逻辑:
if SOURCE_ARCHIVE_URL
body['sourceArchiveUrl'] SOURCE_ARCHIVE_URL
elif SOURCE_REPOSITORY
body['sourceRepository'] {
'url' SOURCE_REPOSITORY
}
elif ZIP_PATH
body['sourceUploadUrl'] ''
default_args['zip_path'] ZIP_PATH
elif SOURCE_UPLOAD_URL
body['sourceUploadUrl'] SOURCE_UPLOAD_URL
else
raise Exception(Please provide one of the source_code parameters)
The code to create the operator
deploy_task GcfFunctionDeployOperator(
task_idgcf_deploy_task
nameFUNCTION_NAME
)
Troubleshooting 障排
If you want to run or deploy an operator using a service account and get forbidden 403 errors it means that your service account does not have the correct Cloud IAM permissions 果您想服务帐户运行部署操作员收禁止403错误意味着您服务帐户没正确Cloud IAM权限
1 Assign your Service Account the Cloud Functions Developer role
2 Grant the user the Cloud IAM Service Account User role on the Cloud Functions runtime service account
The typical way of assigning Cloud IAM permissions with gcloud is shown below Just replace PROJECT_ID with ID of your Google Cloud Platform project and SERVICE_ACCOUNT_EMAIL with the email ID of your service account gcloud分配Cloud IAM权限典型方法示 需PROJECT_ID换您Google Cloud Platform项目IDSERVICE_ACCOUNT_EMAIL换您服务帐户电子邮件ID
gcloud iam serviceaccounts addiampolicybinding \
PROJECT_ID@appspotgserviceaccountcom \
memberserviceAccount[SERVICE_ACCOUNT_EMAIL] \
rolerolesiamserviceAccountUser
See Adding the IAM service agent user role to the runtime service for details
If the source code for your function is in Google Source Repository make sure that your service account has the Source Repository Viewer role so that the source code can be downloaded if necessary 果您功源代码Google Source Repository中请确保您服务帐户具Source Repository Viewer角色便必时载源代码
Slack Operator
This operator is used to send message to slack channel
 
Parameters
· task_id
· channel (required) slack channel
· message (required) message
Note You may have to add the user 'aslack_admin' to the private channel in order to be able to send message to a private channel
实例:
from airflow import SlackOperator
task3 SlackOperator(
task_id'send_slack'
channel'test_cwong'
message'hi catherine'
dagdag)
 
if you want to code the dag so that it sends a failure message to the channel you can code the on_failure_cb as below
on_failure_cb SlackOperator(owner'owner' task_id'task_id' channel'test_channel'message'sorry fail')execute
and add the following code to the task
on_failure_callbackon_failure_cb

SnowFlakeOperator
Snowflake数库中执行sql查询 仅NGAP 20 Airflow
Parameters
· conn_id (string) – The snowflake connection name as configured in Admin → Connections
· sql_file (string) – Location of the query file which needs to be executed
· parameters(dictionary) Parameter that can be passed as a Key value pair
NOTE
1 This is a NGAP 20 only feature All the connection details pertaining to your batch user needs to be configured in connections
2 Accepts multiple queries in a single file
3 If AWS Key and Secret key is required in a query then substitute with aws_s3_key and aws_s3_secret_key Operator is designed to pick the s3_default keys for your environment
4 The variable names(key) in the sql file should be appended with (only in the sql file and not in the input dictionary parameters)
实例:
sample_snowflake_tasks1  SnowFlakeOperator(
   task_id'sample_snowflake_task1'
   sql_file'appbincommonscriptstest_snowflake_s3sql'
   conn_id'snowflake'
   dagdag
   parameters{'location' 's3nikebimanagedbidevdtc_commercetesttable'} 
)
SnowflakeSensor
Runs a sql statement until a criteria is met It will keep trying while sql returns no row or if the first cell in (0 '0' '')
Parameters
· conn_id (string) – The snowflake connection name as configured in Admin → Connections
· sql (string) – The query which needs to be executed
· poke_interval(integer) How often in seconds to run the sql statement
· timeout(integer) How long to wait before failing
· soft_fail(bool) Set to true to mark the task as SKIPPED on failure 

实例:
sample_snowflake_tasks1  SnowflakeSensor(
   task_id 'sample_snowflake_task1'
   sql'select * from some_table'
   poke_interval60 # 1 minute
   timeout3600 # 10 hours
   conn_id'snowflake'
   dagdag
)
TaskDependencySensor
Waits for a task to complete in a different DAG similar to ExternalTaskSensor but with more options
Parameters
· external_dag_id (string) – The dag_id that contains the task you want to wait for
· external_task_id (string) – The task_id that contains the task you want to wait for
· allowed_states (list) – list of allowed states default is ['success'] or ['failed''upstream_failed']
· queue queue name
· execution_delta (datetimetimedelta) – time difference with the previous execution to look at the default is the same execution_date as the current task For yesterday use [positive] datetimetimedelta(days1)
· execution_delta_json (json) – time difference with the previous execution to look at the default is the same execution_date as the current task For yesterday use [positive] datetimetimedelta(days1) eg execution_delta_json{00 09 08 550 151110} for two dags that run at hours 0815 and waiting for the completion of the other previous run of the other dag that runs at 0 8 15 ( dag1 0th hour will poke for dag2 15th hour run on previous day dag1 8th hour run will poke for dag2 1350th hour run dag1 15th hour run will poke for dag2 0350th hour run )
· cluster_id (string) Optional– cluster id of another airflow cluster Note connection of the backend mysql database will have to be set up on the airflow → admin → connection
· queue (string) queue for the task to be sent
NOTE only one of the execution_delta execution_delta_json can be set
实例:
task1 TaskDependencySensor(
task_id'task_dep'
external_dag_id'CheckEMRHealthV2'
external_task_id'success'
allowed_states['success']
cluster_id'dev1713'
queue'airflow'
dagdag

理连接
Airflow needs to know how to connect to your environment Information such as hostname port login and passwords to other systems and services is handled in the Admin>Connections section of the UI The pipeline code you will author will reference the conn_id’ of the Connection objects Airflow需知道连接您环境 UIAdmin>Connections部分处理诸机名端口登录名系统服务密码类信息 您编写道代码引Connection象 conn_id

Connections can be created and managed using either the UI or environment variables UI环境变量创建理连接
See the Connenctions Concepts documentation for more information
Creating a Connection with the UI
Open the Admin>Connections section of the UI Click the Create link to create a new connection 开户界面理>连接部分 单击创建链接创建新连接

1 Fill in the Conn Id field with the desired connection ID It is recommended that you use lowercase characters and separate words with underscores
2 Choose the connection type with the Conn Type field
3 Fill in the remaining fields See Connection Types for a description of the fields belonging to the different connection types
4 Click the Save button to create the connection
Editing a Connection with the UI
Open the Admin>Connections section of the UI Click the pencil icon next to the connection you wish to edit in the connection list

Modify the connection properties and click the Save button to save your changes
Creating a Connection with Environment Variables
Connections in Airflow pipelines can be created using environment variables The environment variable needs to have a prefix of AIRFLOW_CONN_ for Airflow with the value in a URI format to use the connection properly
When referencing the connection in the Airflow pipeline the conn_id should be the name of the variable without the prefix For example if the conn_id is named postgres_master the environment variable should be named AIRFLOW_CONN_POSTGRES_MASTER (note that the environment variable must be all uppercase) Airflow assumes the value returned from the environment variable to be in a URI format (eg postgresuserpassword@localhost5432master or s3accesskeysecretkey@S3)
Connection Types
Google云台
The Google Cloud Platform connection type enables the GCP Integrations
Authenticating to GCP
There are two ways to connect to GCP using Airflow
1 Use Application Default Credentials such as via the metadata server when running on Google Compute Engine
2 Use a service account key file (JSON format) on disk
Default Connection IDs
The following connection IDs are used by default
bigquery_default
Used by the BigQueryHook hook
google_cloud_datastore_default
Used by the DatastoreHook hook
google_cloud_default
Used by the GoogleCloudBaseHook DataFlowHook DataProcHook MLEngineHook and GoogleCloudStorageHook hooks
Configuring the Connection
Project Id (required)
The Google Cloud project ID to connect to
Keyfile Path
Path to a service account key file (JSON format) on disk
Not required if using application default credentials
Keyfile JSON
Contents of a service account key file (JSON format) on disk It is recommended to Secure your connections if using this method to authenticate
Not required if using application default credentials
Scopes (comma separated)
A list of commaseparated Google Cloud scopes to authenticate with
Note
Scopes are ignored when using application default credentials See issue AIRFLOW2522
MySQL
The MySQL connect type allows to connect with MySQL database
Configuring the Connection
Host (required)
The host to connect to
Schema (optional)
Specify the schema name to be used in the database
Login (required)
Specify the user name to connect
Password (required)
Specify the password to connect
Extra (optional)
Specify the charset Example {charset utf8}
Note
If encounter UnicodeDecodeError while working with MySQL connection check the charset defined is matched to the database charset
Securing Connections
By default Airflow will save the passwords for the connection in plain text within the metadata database The crypto package is highly recommended during installation The crypto package does require that your operating system have libffidev installed
If crypto package was not installed initially you can still enable encryption for connections by following steps below
1 Install crypto package pip install apacheairflow[crypto]
2 Generate fernet_key using this code snippet below fernet_key must be a base64encoded 32byte key
from cryptographyfernet import Fernet
fernet_key Fernetgenerate_key()
print(fernet_keydecode()) # your fernet_key keep it in secured place
3 Replace airflowcfg fernet_key value with the one from step 2 Alternatively you can store your fernet_key in OS environment variable You do not need to change airflowcfg in this case as Airflow will use environment variable over the value in airflowcfg
# Note the double underscores
export AIRFLOW__CORE__FERNET_KEYyour_fernet_key
4 Restart Airflow webserver
5 For existing connections (the ones that you had defined before installing airflow[crypto] and creating a Fernet key) you need to open each connection in the connection admin UI retype the password and save it
Writing Logs
Writing Logs Locally
Users can specify a logs folder in airflowcfg using the base_log_folder setting By default it is in the AIRFLOW_HOME directory
In addition users can supply a remote location for storing logs and log backups in cloud storage
In the Airflow Web UI local logs take precedence over remote logs If local logs can not be found or accessed the remote logs will be displayed Note that logs are only sent to remote storage once a task completes (including failure) In other words remote logs for running tasks are unavailable Logs are stored in the log folder as {dag_id}{task_id}{execution_date}{try_number}log
Writing Logs to Amazon S3
Before you begin
Remote logging uses an existing Airflow connection to readwrite logs If you don’t have a connection properly setup this will fail
Enabling remote logging
To enable this feature airflowcfg must be configured as in this example
[core]
# Airflow can store logs remotely in AWS S3 Users must supply a remote
# location URL (starting with either 's3') and an Airflow connection
# id that provides access to the storage location
remote_base_log_folder s3mybucketpathtologs
remote_log_conn_id MyS3Conn
# Use serverside encryption for logs stored in S3
encrypt_s3_logs False
In the above example Airflow will try to use S3Hook('MyS3Conn')
Writing Logs to Azure Blob Storage
Airflow can be configured to read and write task logs in Azure Blob Storage Follow the steps below to enable Azure Blob Storage logging
1 Airflow’s logging system requires a custom py file to be located in the PYTHONPATH so that it’s importable from Airflow Start by creating a directory to store the config file AIRFLOW_HOMEconfig is recommended
2 Create empty files called AIRFLOW_HOMEconfiglog_configpy and AIRFLOW_HOMEconfig__init__py
3 Copy the contents of airflowconfig_templatesairflow_local_settingspy into the log_configpy file that was just created in the step above
4 Customize the following portions of the template
5 # wasb buckets should start with wasb just to help Airflow select correct handler
6 REMOTE_BASE_LOG_FOLDER 'wasb
7
8 # Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG
9 LOGGING_CONFIG
10 Make sure a Azure Blob Storage (Wasb) connection hook has been defined in Airflow The hook should have read and write access to the Azure Blob Storage bucket defined above in REMOTE_BASE_LOG_FOLDER
11 Update AIRFLOW_HOMEairflowcfg to contain
12 remote_logging True
13 logging_config_class log_configLOGGING_CONFIG
14 remote_log_conn_id
15 Restart the Airflow webserver and scheduler and trigger (or wait for) a new task execution
16 Verify that logs are showing up for newly executed tasks in the bucket you’ve defined
Writing Logs to Google Cloud Storage
Follow the steps below to enable Google Cloud Storage logging
To enable this feature airflowcfg must be configured as in this example
[core]
# Airflow can store logs remotely in AWS S3 Google Cloud Storage or Elastic Search
# Users must supply an Airflow connection id that provides access to the storage
# location If remote_logging is set to true see UPDATINGmd for additional
# configuration requirements
remote_logging True
remote_base_log_folder gsmybucketpathtologs
remote_log_conn_id MyGCSConn
1 Install the gcp_api package first like so pip install apacheairflow[gcp_api]
2 Make sure a Google Cloud Platform connection hook has been defined in Airflow The hook should have read and write access to the Google Cloud Storage bucket defined above in remote_base_log_folder
3 Restart the Airflow webserver and scheduler and trigger (or wait for) a new task execution
4 Verify that logs are showing up for newly executed tasks in the bucket you’ve defined
5 Verify that the Google Cloud Storage viewer is working in the UI Pull up a newly executed task and verify that you see something like
6 *** Reading remote log from gsexample_bash_operatorrun_this_last20171003T00000016log
7 [20171003 215750056] {clipy377} INFO Running on host chrisr00532
8 [20171003 215750093] {base_task_runnerpy115} INFO Running ['bash' 'c' u'airflow run example_bash_operator run_this_last 20171003T000000 job_id 47 raw sd DAGS_FOLDERexample_dagsexample_bash_operatorpy']
9 [20171003 215751264] {base_task_runnerpy98} INFO Subtask [20171003 215751263] {__init__py45} INFO Using executor SequentialExecutor
10 [20171003 215751306] {base_task_runnerpy98} INFO Subtask [20171003 215751306] {modelspy186} INFO Filling up the DagBag from airflowdagsexample_dagsexample_bash_operatorpy
Note the top line that says it’s reading from the remote log file
Scaling Out with Celery
CeleryExecutor is one of the ways you can scale out the number of workers For this to work you need to setup a Celery backend (RabbitMQ Redis …) and change your airflowcfg to point the executor parameter to CeleryExecutor and provide the related Celery settings
For more information about setting up a Celery broker refer to the exhaustive Celery documentation on the topic
Here are a few imperative requirements for your workers
· airflow needs to be installed and the CLI needs to be in the path
· Airflow configuration settings should be homogeneous across the cluster
· Operators that are executed on the worker need to have their dependencies met in that context For example if you use the HiveOperator the hive CLI needs to be installed on that box or if you use the MySqlOperator the required Python library needs to be available in the PYTHONPATH somehow
· The worker needs to have access to its DAGS_FOLDER and you need to synchronize the filesystems by your own means A common setup would be to store your DAGS_FOLDER in a Git repository and sync it across machines using Chef Puppet Ansible or whatever you use to configure machines in your environment If all your boxes have a common mount point having your pipelines files shared there should work as well
To kick off a worker you need to setup Airflow and kick off the worker subcommand
airflow worker
Your worker should start picking up tasks as soon as they get fired in its direction
Note that you can also run Celery Flower a web UI built on top of Celery to monitor your workers You can use the shortcut command airflow flower to start a Flower web server
Some caveats
· Make sure to use a database backed result backend
· Make sure to set a visibility timeout in [celery_broker_transport_options] that exceeds the ETA of your longest running task
· Tasks can consume resources Make sure your worker has enough resources to run worker_concurrency tasks
Scaling Out with Dask
DaskExecutor allows you to run Airflow tasks in a Dask Distributed cluster
Dask clusters can be run on a single machine or on remote networks For complete details consult the Distributed documentation
To create a cluster first start a Scheduler
# default settings for a local cluster
DASK_HOST127001
DASK_PORT8786

daskscheduler host DASK_HOST port DASK_PORT
Next start at least one Worker on any machine that can connect to the host
daskworker DASK_HOSTDASK_PORT
Edit your airflowcfg to set your executor to DaskExecutor and provide the Dask Scheduler address in the [dask] section
Please note
· Each Dask worker must be able to import Airflow and any dependencies you require
· Dask does not support queues If an Airflow task was created with a queue a warning will be raised but the task will be submitted to the cluster
Scaling Out with Mesos (community contributed)
There are two ways you can run airflow as a mesos framework
1 Running airflow tasks directly on mesos slaves requiring each mesos slave to have airflow installed and configured
2 Running airflow tasks inside a docker container that has airflow installed which is run on a mesos slave
Tasks executed directly on mesos slaves
MesosExecutor allows you to schedule airflow tasks on a Mesos cluster For this to work you need a running mesos cluster and you must perform the following steps
1 Install airflow on a mesos slave where web server and scheduler will run let’s refer to this as the Airflow server
2 On the Airflow server install mesos python eggs from mesos downloads
3 On the Airflow server use a database (such as mysql) which can be accessed from all mesos slaves and add configuration in airflowcfg
4 Change your airflowcfg to point executor parameter to MesosExecutor and provide related Mesos settings
5 On all mesos slaves install airflow Copy the airflowcfg from Airflow server (so that it uses same sql alchemy connection)
6 On all mesos slaves run the following for serving logs
airflow serve_logs
7 On Airflow server to start processingscheduling DAGs on mesos run
airflow scheduler p
Note We need p parameter to pickle the DAGs
You can now see the airflow framework and corresponding tasks in mesos UI The logs for airflow tasks can be seen in airflow UI as usual
For more information about mesos refer to mesos documentation For any queriesbugs on MesosExecutor please contact @kapilmalik
Tasks executed in containers on mesos slaves
This gist contains all files and configuration changes necessary to achieve the following
1 Create a dockerized version of airflow with mesos python eggs installed
We recommend taking advantage of docker’s multi stage builds in order to achieve this We have one Dockerfile that defines building a specific version of mesos from source (Dockerfilemesos) in order to create the python eggs In the airflow Dockerfile (Dockerfileairflow) we copy the python eggs from the mesos image
2 Create a mesos configuration block within the airflowcfg
The configuration block remains the same as the default airflow configuration (default_airflowcfg) but has the addition of an option docker_image_slave This should be set to the name of the image you would like mesos to use when running airflow tasks Make sure you have the proper configuration of the DNS record for your mesos master and any sort of authorization if any exists
3 Change your airflowcfg to point the executor parameter to MesosExecutor (executor SequentialExecutor)
4 Make sure your mesos slave has access to the docker repository you are using for your docker_image_slave
Instructions are available in the mesos docs
The rest is up to you and how you want to work with a dockerized airflow configuration
Running Airflow with systemd
Airflow can integrate with systemd based systems This makes watching your daemons easy as systemd can take care of restarting a daemon on failure In the scriptssystemd directory you can find unit files that have been tested on Redhat based systems You can copy those to usrlibsystemdsystem It is assumed that Airflow will run under airflowairflow If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files
Environment configuration is picked up from etcsysconfigairflow An example file is supplied Make sure to specify the SCHEDULER_RUNS variable in this file when you run the scheduler You can also define here for example AIRFLOW_HOME or AIRFLOW_CONFIG
Running Airflow with upstart
Airflow can integrate with upstart based systems Upstart automatically starts all airflow services for which you have a corresponding *conf file in etcinit upon system boot On failure upstart automatically restarts the process (until it reaches respawn limit set in a *conf file)
You can find sample upstart job files in the scriptsupstart directory These files have been tested on Ubuntu 1404 LTS You may have to adjust start on and stop on stanzas to make it work on other upstart systems Some of the possible options are listed in scriptsupstartREADME
Modify *conf files as needed and copy to etcinit directory It is assumed that airflow will run under airflowairflow Change setuid and setgid in *conf files if you use other usergroup
You can use initctl to manually start stop view status of the airflow process that has been integrated with upstart
initctl airflowwebserver status
Using the Test Mode Configuration
Airflow has a fixed set of test mode configuration options You can load these at any time by calling airflowconfigurationload_test_config() (note this operation is not reversible) However some options (like the DAG_FOLDER) are loaded before you have a chance to call load_test_config() In order to eagerly load the test configuration set test_mode in airflowcfg
[tests]
unit_test_mode True
Due to Airflow’s automatic environment variable expansion (see Setting Configuration Options) you can also set the env var AIRFLOW__CORE__UNIT_TEST_MODE to temporarily overwrite airflowcfg
图形界面截图
The Airflow UI makes it easy to monitor and troubleshoot your data pipelines Here’s a quick overview of some of the features and visualizations you can find in the Airflow UI
DAGs View
List of the DAGs in your environment and a set of shortcuts to useful pages You can see exactly how many tasks succeeded failed or are currently running at a glance



Tree View
A tree representation of the DAG that spans across time If a pipeline is late you can quickly see where the different steps are and identify the blocking ones



Graph View
The graph view is perhaps the most comprehensive Visualize your DAG’s dependencies and their current status for a specific run



Variable View
The variable view allows you to list create edit or delete the keyvalue pair of a variable used during jobs Value of a variable will be hidden if the key contains any words in (password’ secret’ passwd’ authorization’ api_key’ apikey’ access_token’) by default but can be configured to show in cleartext



Gantt Chart
The Gantt chart lets you analyse task duration and overlap You can quickly identify bottlenecks and where the bulk of the time is spent for specific DAG runs



Task Duration
The duration of your different tasks over the past N runs This view lets you find outliers and quickly understand where the time is spent in your DAG over many runs



Code View
Transparency is everything While the code for your pipeline is in source control this is a quick way to get to the code that generates the DAG and provide yet more context



Task Instance Context Menu
From the pages seen above (tree view graph view gantt …) it is always possible to click on a task instance and get to this rich context menu that can take you to more detailed metadata and perform some actions


概念
Airflow Platform描述执行监控工作流工具
核心理念
DAG
Airflow中DAG(定非循环图)您运行务集合反映关系赖关系方式进行组织
例简单DAG包含三务:ABC说A必须B运行前成功运行C时运行务A5分钟超时B重新启动5次防失败说工作流程天晚10点运行应该某特定日期前开始
In this way a DAG describes how you want to carry out your workflow but notice that we haven’t said anything about what we actually want to do A B and C could be anything Maybe A prepares data for B to analyze while C sends an email 通种方式DAG描述您希执行工作流程请注意没说真正想做事情 ABC东西C发送电子邮件时许A准备B进行分析数 Or perhaps A monitors your location so B can open your garage door while C turns on your house lights 者许A监控位置样B开车库门C开房子灯 The important thing is that the DAG isn’t concerned with what its constituent tasks do its job is to make sure that whatever they do happens at the right time or in the right order or with the right handling of any unexpected issues 重发展议程集团关心组成务作工作确保做什正确时间正确序正确处理意外问题
DAGs are defined in standard Python files that are placed in Airflow’s DAG_FOLDER Airflow will execute the code in each file to dynamically build the DAG objects DAG标准Python文件中定义文件放AirflowDAG_FOLDER中 Airflow执行文件中代码动态构建DAG象 You can have as many DAGs as you want each describing an arbitrary number of tasks In general each one should correspond to a single logical workflow 您拥意数量DAGDAG描述意数量务通常应该应单逻辑工作流
注意
When searching for DAGs Airflow will only consider files where the string airflow and DAG both appear in the contents of the py file 搜索DAG时Airflow仅考虑py文件容中时出现字符串 airflow DAG文件
范围
Airflow will load any DAG object it can import from a DAGfile Critically that means the DAG must appear in globals() Consider the following two DAGs Only dag_1 will be loaded the other one only appears in a local scope Airflow加载DAG文件导入DAG象重意味着DAG必须出现globals()中考虑两DAG会加载dag_1出现范围
dag_1 DAG('this_dag_will_be_discovered')

def my_function()
dag_2 DAG('but_this_dag_will_not')

my_function()
Sometimes this can be put to good use For example a common pattern with SubDagOperator is to define the subdag inside a function so that Airflow doesn’t try to load it as a standalone DAG 时利例SubDagOperator常见模式定义函数子标记便Airflow会尝试作独立DAG加载
默认参数
果default_args字典传递DAG应运算符容易公参数应许运算符需次键入
default_args {
'start_date' datetime(2016 1 1)
'owner' 'Airflow'
}

dag DAG('my_dag' default_argsdefault_args)
op DummyOperator(task_id'dummy' dagdag)
print(opowner) # Airflow
Context Manager
Added in Airflow 18
DAGs can be used as context managers to automatically assign new operators to that DAG DAG作文理器动新运算符分配该DAG
with DAG('my_dag' start_datedatetime(2016 1 1)) as dag
op DummyOperator('op')

opdag is dag # True
Operator
然DAG描述运行工作流Operator确定实际完成工作
An operator describes a single task in a workflow Operators are usually (but not always) atomic meaning they can stand on their own and don’t need to share resources with any other operators 操作员描述工作流中单务运营商通常(非总)原子意味着独立运营需运营商享资源 The DAG will make sure that operators run in the correct certain order other than those dependencies operators generally run independently In fact they may run on two completely different machines DAG确保运营商正确序运行赖项外运营商通常独立运行实际两台完全机器运行
This is a subtle but very important point in general if two operators need to share information like a filename or small amount of data you should consider combining them into a single operator 微妙非常重点:通常果两运营商需享信息文件名少量数您应该考虑组合运算符中 If it absolutely can’t be avoided Airflow does have a feature for operator crosscommunication called XCom that is described elsewhere in this document 果绝法避免Airflow确实具操作员交叉通信功称XCom文档部分进行描述
Airflow provides operators for many common tasks including Airflow许常见务提供操作员包括:
· BashOperator 执行bash命令
· PythonOperator 调意Python函数
· EmailOperator sends an email
· SimpleHttpOperator sends an HTTP request
· MySqlOperator SqliteOperator PostgresOperator MsSqlOperator OracleOperator JdbcOperator etc executes a SQL command
· Sensor waits for a certain time file database row S3 key etc…
In addition to these basic building blocks there are many more specific operators DockerOperator HiveOperator S3FileTransformOperator PrestoToMysqlOperator SlackOperator… you get the idea 基构建块外许特定运算符:DockerOperatorHiveOperatorS3FileTransformOperatorPrestoToMysqlOperatorSlackOperator 明白
The airflowcontrib directory contains yet more operators built by the community These operators aren’t always as complete or welltested as those in the main distribution but allow users to more easily add new functionality to the platform airflow contrib 目录包含更社区构建运算符运算符总发行版中样完整良测试允许户更轻松台添加新功
Operators are only loaded by Airflow if they are assigned to a DAG 果操作员分配DAG操作员仅Airflow加载
See Using Operators for how to use Airflow operators 请参阅运算符解Airflow运算符
DAG Assignment
Added in Airflow 18
Operators do not have to be assigned to DAGs immediately (previously dag was a required argument) 操作员必立分配DAG(前dag必需参数) However once an operator is assigned to a DAG it can not be transferred or unassigned DAG assignment can be done explicitly when the operator is created through deferred assignment or even inferred from other operators 旦运营商分配DAG法转移取消分配创建运算符时通延迟赋值甚运算符推断显式完成DAG分配
dag DAG('my_dag' start_datedatetime(2016 1 1))

# sets the DAG explicitly
explicit_op DummyOperator(task_id'op1' dagdag)

# deferred DAG assignment
deferred_op DummyOperator(task_id'op2')
deferred_opdag dag

# inferred DAG assignment (linked operators must be in the same DAG)
inferred_op DummyOperator(task_id'op3')
inferred_opset_upstream(deferred_op)
Bitshift Composition Bitshift成分
Added in Airflow 18
Traditionally operator relationships are set with the set_upstream() and set_downstream() methods In Airflow 18 this can be done with the Python bitshift operators >> and << The following four statements are all functionally equivalent 传统set_upstream()set_downstream()方法设置运算符关系Airflow 18中通Python bitshift操作符>><<完成四语句功等效:
op1 >> op2
op1set_downstream(op2)

op2 << op1
op2set_upstream(op1)
When using the bitshift to compose operators the relationship is set in the direction that the bitshift operator points For example op1 >> op2 means that op1 runs first and op2 runs second bitshift组合运算符时关系设置bitshift运算符指方例op1 >> op2表示op1先运行op2运行第二 Multiple operators can be composed – keep in mind the chain is executed lefttoright and the rightmost object is always returned For example 组成运算符 请记住链左右执行始终返回右边象例:
op1 >> op2 >> op3 << op4
相:
op1set_downstream(op2)
op2set_downstream(op3)
op3set_upstream(op4)
For convenience the bitshift operators can also be used with DAGs For example 方便起见bitshift运算符DAG起例:
dag >> op1 >> op2
相:
op1dag dag
op1set_downstream(op2)
We can put this all together to build a simple pipeline 切放起构建简单道:
with DAG('my_dag' start_datedatetime(2016 1 1)) as dag
(
DummyOperator(task_id'dummy_1')
>> BashOperator(
task_id'bash_1'
bash_command'echo HELLO')
>> PythonOperator(
task_id'python_1'
python_callablelambda print(GOODBYE))
)
Task
Once an operator is instantiated it is referred to as a task The instantiation defines specific values when calling the abstract operator and the parameterized task becomes a node in a DAG 旦运算符实例化称务实例化调抽象运算符时定义特定值参数化务成DAG中节点
Task Instance
A task instance represents a specific run of a task and is characterized as the combination of a dag a task and a point in time Task instances also have an indicative state which could be running success failed skipped up for retry etc 务实例表示务特定运行特征dag务时间点组合务实例指示状态运行成功失败跳重试等
Workflow
You’re now familiar with the core building blocks of Airflow Some of the concepts may sound very similar but the vocabulary can be conceptualized like this 您现熟悉Airflow核心构建模块概念听起非常相似词汇表概念化:
· DAG a description of the order in which work should take place
· Operator a class that acts as a template for carrying out some work
· Task a parameterized instance of an operator
· Task Instance a task that 1) has been assigned to a DAG and 2) has a state associated with a specific run of the DAG
By combining DAGs and Operators to create TaskInstances you can build complex workflows 通组合DAG运算符创建TaskInstances您构建复杂工作流
附加功
In addition to the core Airflow objects there are a number of more complex features that enable behaviors like limiting simultaneous access to resources crosscommunication conditional execution and more 核心Airflow象外许更复杂功实现限制时访问资源交叉通信条件执行等行
Hook
Hooks are interfaces to external platforms and databases like Hive S3 MySQL Postgres HDFS and Pig Hooks implement a common interface when possible and act as a building block for operators 钩子外部台数库接口HiveS3MySQLPostgresHDFSPig Hooks实现通接口充运营商构建块They also use the airflowmodelsConnection model to retrieve hostnames and authentication information Hooks keep authentication code and information out of pipelines centralized in the metadata database airflowmodelsConnection模型检索机名身份验证信息挂钩身份验证代码信息保存道外集中元数数库中
Hooks are also very useful on their own to use in Python scripts Airflow airflowoperatorsPythonOperator and in interactive environments like iPython or Jupyter Notebook 钩子Python脚Airflow airflowoperatorsPythonOperatoriPythonJupyter Notebook等交互式环境中非常
Pool
Some systems can get overwhelmed when too many processes hit them at the same time Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks 太进程时攻击时某系统会淹没气流池限制意务集执行行性 The list of pools is managed in the UI (Menu > Admin > Pools) by giving the pools a name and assigning it a number of worker slots 通池命名分配工作槽UI(菜单 >理 >池)中理池列表 Tasks can then be associated with one of the existing pools by using the pool parameter when creating tasks (ie instantiating operators) 然创建务时(实例化运算符)通池参数务现池相关联
aggregate_db_message_job BashOperator(
task_id'aggregate_db_message_job'
execution_timeouttimedelta(hours3)
pool'ep_data_pipeline_db_msg_agg'
bash_commandaggregate_db_message_job_cmd
dagdag)
aggregate_db_message_jobset_upstream(wait_for_empty_queue)
The pool parameter can be used in conjunction with priority_weight to define priorities in the queue and which tasks get executed first as slots open up in the pool pool参数priority_weight结合定义队列中优先级池中开槽时首先执行务 The default priority_weight is 1 and can be bumped to any number When sorting the queue to evaluate which task should be executed next we use the priority_weight summed up with all of the priority_weight values from tasks downstream from this task 默认priority_weight1碰数字队列进行排序评估接应该执行务时priority_weight务游务priority_weight值相加 You can use this to bump a specific important task and the whole path to that task gets prioritized accordingly 您执行特定重务相应优先处理该务整路径
Tasks will be scheduled as usual while the slots fill up Once capacity is reached runnable tasks get queued and their state will show as such in the UI 插槽填满时务常安排达容量运行务排队状态UI中显示 As slots free up queued tasks start running based on the priority_weight (of the task and its descendants) 插槽空闲时排队务根priority_weight(务代)开始运行
Note that by default tasks aren’t assigned to any pool and their execution parallelism is only limited to the executor’s setting 请注意默认情况务会分配池执行行性仅限执行程序设置
Connection
The connection information to external systems is stored in the Airflow metadata database and managed in the UI (Menu > Admin > Connections) 外部系统连接信息存储Airflow元数数库中UI中进行理(菜单 >理 >连接) A conn_id is defined there and hostname login password schema information attached to it Airflow pipelines can simply refer to the centrally managed conn_id without having to hard code any of this information anywhere 里定义conn_id附加hostname login password schema信息气流道简单引集中理conn_id需方硬编码类信息
Many connections with the same conn_id can be defined and when that is the case and when the hooks uses the get_connection method from BaseHook Airflow will choose one connection randomly allowing for some basic load balancing and fault tolerance when used in conjunction with retries 定义具相conn_id许连接种情况挂钩BaseHookget_connection方法时Airflow机选择连接允许重试起时进行基负载衡容错
Airflow also has the ability to reference connections via environment variables from the operating system But it only supports URI format If you need to specify extra for your connection please use web UI Airflow够通操作系统中环境变量引连接支持URI格式果您需连接指定额外容请Web UI
If connections with the same conn_id are defined in both Airflow metadata database and environment variables only the one in environment variables will be referenced by Airflow (for example given conn_id postgres_master Airflow will search for AIRFLOW_CONN_POSTGRES_MASTER in environment variables first and directly reference it if found before it starts to search in metadata database) 果Airflow元数数库环境变量中定义具相conn_id连接Airflow仅引环境变量中连接(例定conn_id postgres_masterAirflow首先环境变量中搜索AIRFLOW_CONN_POSTGRES_MASTER直接引果发现开始搜索元数数库前)
Many hooks have a default conn_id where operators using that hook do not need to supply an explicit connection ID For example the default conn_id for the PostgresHook is postgres_default 许挂钩默认conn_id该挂钩运算符需提供显式连接ID例PostgresHook默认conn_idpostgres_default
See Managing Connections for how to create and manage connections 请参阅理Connections解创建理连接
Queue
When using the CeleryExecutor the Celery queues that tasks are sent to can be specified queue is an attribute of BaseOperator so any task can be assigned to any queue CeleryExecutor时指定发送务Celery队列 queueBaseOperator属性务分配队列 The default queue for the environment is defined in the airflowcfg’s celery > default_queue This defines the queue that tasks get assigned to when not specified as well as which queue Airflow workers listen to when started 环境默认队列airflowcfgcelery > default_queue中定义定义未指定务时分配队列Airflow工作程序启动时侦听队列
Workers can listen to one or multiple queues of tasks When a worker is started (using the command airflow worker) a set of commadelimited queue names can be specified (eg airflow worker q spark) This worker will then only pick up tasks wired to the specified queue(s) 工作员收听务队列工作程序启动时(命令气流工作程序)指定组逗号分隔队列名称(例气流工作者q spark)然该工作员仅接收连接指定队列务
This can be useful if you need specialized workers either from a resource perspective (for say very lightweight tasks where one worker could take thousands of tasks without a problem) or from an environment perspective (you want a worker running from within the Spark cluster itself because it needs a very specific environment and security rights) 果您需专业工作员资源角度(例工作员毫问题执行数千务)者环境角度(您希工作员Spark群集中运行)非常身需非常具体环境安全权利)
XCom
XComs let tasks exchange messages allowing more nuanced forms of control and shared state The name is an abbreviation of crosscommunication XComs允许务交换消息允许更细微控制形式享状态该名称交叉通信缩写 XComs are principally defined by a key value and timestamp but also track attributes like the taskDAG that created the XCom and when it should become visible XComs键值时间戳定义踪创建XCom务 DAG时应该见属性 Any object that can be pickled can be used as an XCom value so users should make sure to use objects of appropriate size pickle象作XCom值户应该确保适象
XComs can be pushed (sent) or pulled (received) When a task pushes an XCom it makes it generally available to other tasks 推(发送)拉(接收)XComs务推送XCom时通常务 Tasks can push XComs at any time by calling the xcom_push() method In addition if a task returns a value (either from its Operator’s execute() method or from a PythonOperator’s python_callable function) then an XCom containing that value is automatically pushed 务通调xcom_push()方法时推送XComs外果务返回值(Operatorexecute()方法者PythonOperatorpython_callable函数)会动推送包含该值XCom
Tasks call xcom_pull() to retrieve XComs optionally applying filters based on criteria like key source task_ids and source dag_id By default xcom_pull() filters for the keys that are automatically given to XComs when they are pushed by being returned from execute functions (as opposed to XComs that are pushed manually) 务调xcom_pull()检索XComs选根keysource task_idssource dag_id等条件应滤器默认情况xcom_pull()滤掉执行函数返回时动赋予XCom键(手动推送XCom相反)
If xcom_pull is passed a single string for task_ids then the most recent XCom value from that task is returned if a list of task_ids is passed then a corresponding list of XCom values is returned
# inside a PythonOperator called 'pushing_task'
def push_function()
return value

# inside another PythonOperator where provide_contextTrue
def pull_function(**context)
value context['task_instance']xcom_pull(task_ids'pushing_task')
It is also possible to pull XCom directly in a template here’s an example of what this may look like
SELECT * FROM {{ task_instancexcom_pull(task_ids'foo' key'table_name') }}
Note that XComs are similar to Variables but are specifically designed for intertask communication rather than global settings
Variables
Variables are a generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow Variables can be listed created updated and deleted from the UI (Admin > Variables) code or CLI In addition json settings files can be bulk uploaded through the UI While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control it can be useful to have some variables or configuration items accessible and modifiable through the UI 变量Airflow中存储检索意容设置作简单键值存储通方法 UI(理员>变量)代码CLI列出创建更新删变量 外通UI批量传json设置文件 您道代码定义数常量变量应代码中定义存储源代码控制中某变量配置项通UI进行访问修改会
from airflowmodels import Variable
foo Variableget(foo)
bar Variableget(bar deserialize_jsonTrue)
The second call assumes json content and will be deserialized into bar Note that Variable is a sqlalchemy model and can be used as such 第二调采json容反序列化bar 请注意变量sqlalchemy模型
您jinja模板中变量语法:
echo {{ varvalue }}
者果您需变量反序列化json象:
echo {{ varjson }}
Branching
Sometimes you need a workflow to branch or only go down a certain path based on an arbitrary condition which is typically related to something that happened in an upstream task One way to do this is by using the BranchPythonOperator 时您需工作流进行分支者仅根通常游务中发生情况关意条件特定路径走 种方法BranchPythonOperator
The BranchPythonOperator is much like the PythonOperator except that it expects a python_callable that returns a task_id The task_id returned is followed and all of the other paths are skipped The task_id returned by the Python function has to be referencing a task directly downstream from the BranchPythonOperator task BranchPythonOperatorPythonOperator非常相似处希返回task_idpython_callable 遵循返回task_id跳路径 Python函数返回task_id必须直接BranchPythonOperator务游引务
Note that using tasks with depends_on_pastTrue downstream from BranchPythonOperator is logically unsound as skipped status will invariably lead to block tasks that depend on their past successes skipped states propagates where all directly upstream tasks are skipped 请注意逻辑讲BranchPythonOperator游depends_on_past True务逻辑合理跳状态总会导致赖成功阻止务 跳状态会传播中直接游务跳
If you want to skip some tasks keep in mind that you can’t have an empty path if so make a dummy task 果您想跳务请记住您空路径果样请执行虚拟务
like this the dummy task branch_false is skipped样虚拟务 branch_false跳

Not like this where the join task is skipped样跳加入务

SubDAGs
SubDAGs are perfect for repeating patterns Defining a function that returns a DAG object is a nice design pattern when using Airflow SubDAG非常适合重复模式 Airflow时定义返回DAG象函数种错设计模式
Airbnb uses the stagecheckexchange pattern when loading data Data is staged in a temporary table after which data quality checks are performed against that table Once the checks all pass the partition is moved into the production table 加载数时Airbnb阶段检查交换模式 数暂存时表中然该表执行数质量检查 旦检查通分区会移生产表中
作示例请考虑DAG:

We can combine all of the parallel task* operators into a single SubDAG so that the resulting DAG resembles the following 行task *运算符组合单SubDAG中便生成DAG类似容:

Note that SubDAG operators should contain a factory method that returns a DAG object This will prevent the SubDAG from being treated like a separate DAG in the main UI For example 请注意SubDAG运算符应包含返回DAG象工厂方法 防止UI中SubDAG视单独DAG 例:
#dagssubdagpy
from airflowmodels import DAG
from airflowoperatorsdummy_operator import DummyOperator


# Dag is returned by a factory method
def sub_dag(parent_dag_name child_dag_name start_date schedule_interval)
dag DAG(
'ss' (parent_dag_name child_dag_name)
schedule_intervalschedule_interval
start_datestart_date
)

dummy_operator DummyOperator(
task_id'dummy_task'
dagdag
)

return dag
然您DAG文件中引SubDAG:
# main_dagpy
from datetime import datetime timedelta
from airflowmodels import DAG
from airflowoperatorssubdag_operator import SubDagOperator
from dagssubdag import sub_dag


PARENT_DAG_NAME 'parent_dag'
CHILD_DAG_NAME 'child_dag'

main_dag DAG(
dag_idPARENT_DAG_NAME
schedule_intervaltimedelta(hours1)
start_datedatetime(2016 1 1)
)

sub_dag SubDagOperator(
subdagsub_dag(PARENT_DAG_NAME CHILD_DAG_NAME main_dagstart_date
main_dagschedule_interval)
task_idCHILD_DAG_NAME
dagmain_dag
)
You can zoom into a SubDagOperator from the graph view of the main DAG to show the tasks contained within the SubDAG 您DAG图形视图放SubDagOperator显示SubDAG中包含务:

SubDAG时技巧:
· by convention a SubDAG’s dag_id should be prefixed by its parent and a dot As in parentchild惯例SubDAGdag_id应该父项点作前缀 parentchild中样
· 通参数传递SubDAG operatorDAGSubDAG间享参数(示)
· SubDAGs must have a schedule and be enabled If the SubDAG’s schedule is set to None or @once the SubDAG will succeed without having done anything SubDAG必须具时间表已启 果SubDAG时间表设置None@onceSubDAG成功执行需执行操作
· clearing a SubDagOperator also clears the state of the tasks within清SubDagOperator会清中务状态
· marking success on a SubDagOperator does not affect the state of the tasks withinSubDagOperator标记成功会影响中务状态
· refrain from using depends_on_pastTrue in tasks within the SubDAG as this can be confusingSubDAG务中depends_on_past True会造成混淆
· it is possible to specify an executor for the SubDAG It is common to use the SequentialExecutor if you want to run the SubDAG inprocess and effectively limit its parallelism to one Using LocalExecutor can be problematic as it may oversubscribe your worker running multiple tasks in a single slotSubDAG指定执行程序 果程中运行SubDAG行性效限制通常SequentialExecutor LocalExecutor会出现问题会超额预定您工作员单插槽中运行务
关示例请参见airflowexample_dags
SLAs
Service Level Agreements or time by which a task or DAG should have succeeded can be set at a task level as a timedelta If one or many instances have not succeeded by that time an alert email is sent detailing the list of tasks that missed their SLA The event is also recorded in the database and made available in the web UI under Browse>Missed SLAs where events can be analyzed and documented 务级服务级协议务DAG应该成功通时间设置时间增量 果时实例没成功会发送封警报电子邮件详细列出错SLA务列表 该事件记录数库中Web UI中浏览>缺少SLA中分析记录事件
触发规
Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded Airflow allows for more complex dependency settings 正常工作流程行直接游务成功完成触发务Airflow允许进行更复杂赖项设置
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered The default value for trigger_rule is all_success and can be defined as trigger this task when all directly upstream tasks have succeeded All other rules described here are based on direct parent tasks and are values that can be passed to any operator while creating tasks 运算符trigger_rule参数该参数定义触发生成务规 trigger_rule默认值all_success定义直接游务成功时触发务 处描述规均基直接父务创建务时传递操作员值:
· all_success (default) all parents have succeeded(默认)父母成功
· all_failed all parents are in a failed or upstream_failed state父母处失败游失败状态
· all_done all parents are done with their execution父母完成处决
· one_failed fires as soon as at least one parent has failed it does not wait for all parents to be done少位父母失败立触发会等父母完成
· one_success fires as soon as at least one parent succeeds it does not wait for all parents to be done少位父母成功会触发会等父母完成
· dummy dependencies are just for show trigger at will赖显示意触发
Note that these can be used in conjunction with depends_on_past (boolean) that when set to True keeps a task from getting triggered if the previous schedule for the task hasn’t succeeded 请注意depends_on_past(布尔值)结合设置True时果先前务计划未成功执行会触发务
Latest Run Only仅新运行
Standard workflow behavior involves running a series of tasks for a particular datetime range Some workflows however perform tasks that are independent of run time but need to be run on a schedule much like a standard cron job In these cases backfills or running jobs missed during a pause just wastes CPU cycles 标准工作流程行涉针特定日期时间范围运行系列务 某工作流程执行务运行时间关需计划运行标准cron作业样 情况暂停期间错回填正运行作业会浪费CPU周期
For situations like this you can use the LatestOnlyOperator to skip tasks that are not being run during the most recent scheduled run for a DAG The LatestOnlyOperator skips all immediate downstream tasks and itself if the time right now is not between its execution_time and the next scheduled execution_time 种情况您LatestOnlyOperator跳DAG新计划运行期间未运行务 果前时间execute_time计划execute_time间LatestOnlyOperator会跳紧接游务身
One must be aware of the interaction between skipped tasks and trigger rules Skipped tasks will cascade through trigger rules all_success and all_failed but not all_done one_failed one_success and dummy If you would like to use the LatestOnlyOperator with trigger rules that do not cascade skips you will need to ensure that the LatestOnlyOperator is directly upstream of the task you would like to skip 必须意识跳务触发规间相互作 跳务通触发规all_successall_failed进行级联all_doneone_failedone_success虚拟规 果您希LatestOnlyOperator会级联跳触发规起需确保LatestOnlyOperator直接您跳务游
It is possible through use of trigger rules to mix tasks that should run in the typical datetime dependent mode and those using the LatestOnlyOperator 通触发规混合应典型日期时间相关模式运行务LatestOnlyOperator务
For example consider the following DAG 例考虑问题:
#dagslatest_only_with_triggerpy
import datetime as dt

from airflowmodels import DAG
from airflowoperatorsdummy_operator import DummyOperator
from airflowoperatorslatest_only_operator import LatestOnlyOperator
from airflowutilstrigger_rule import TriggerRule


dag DAG(
dag_id'latest_only_with_trigger'
schedule_intervaldttimedelta(hours4)
start_datedtdatetime(2016 9 20)
)

latest_only LatestOnlyOperator(task_id'latest_only' dagdag)

task1 DummyOperator(task_id'task1' dagdag)
task1set_upstream(latest_only)

task2 DummyOperator(task_id'task2' dagdag)

task3 DummyOperator(task_id'task3' dagdag)
task3set_upstream([task1 task2])

task4 DummyOperator(task_id'task4' dagdag
trigger_ruleTriggerRuleALL_DONE)
task4set_upstream([task1 task2])
In the case of this dag the latest_only task will show up as skipped for all runs except the latest run task1 is directly downstream of latest_only and will also skip for all runs except the latest task2 is entirely independent of latest_only and will run in all scheduled periods task3 is downstream of task1 and task2 and because of the default trigger_rule being all_success will receive a cascaded skip from task1 task4 is downstream of task1 and task2 but since its trigger_rule is set to all_done it will trigger as soon as task1 has been skipped (a valid completion state) and task2 has succeeded 种情况新运行(新运行外)新运行显示last_only务 task1直接位Latest_only游跳新操作外运行 task2完全独立latest_only计划时间段运行 task3位task1task2游默认trigger_ruleall_successtask1接收级联跳 task4位task1task2游trigger_rule设置all_donetask1跳(效完成状态)task2成功立触发

Zombies & Undeads僵尸亡灵
Task instances die all the time usually as part of their normal life cycle but sometimes unexpectedly 务实例通常正常生命周期中直时刻死时出意料
Zombie tasks are characterized by the absence of an heartbeat (emitted by the job periodically) and a running status in the database They can occur when a worker node can’t reach the database when Airflow processes are killed externally or when a node gets rebooted for instance Zombie killing is performed periodically by the scheduler’s process 僵尸务特点没心跳(作业定期发出)数库中运行状态 工作程序节点法访问数库外部停止Airflow进程例重新启动节点时会发生种情况 僵尸杀死调度程序定期执行
Undead processes are characterized by the existence of a process and a matching heartbeat but Airflow isn’t aware of this task as running in the database This mismatch typically occurs as the state of the database is altered most likely by deleting rows in the Task Instances view in the UI Tasks are instructed to verify their state as part of the heartbeat routine and terminate themselves upon figuring out that they are in this undead state 死进程特点存进程匹配心跳Airflow知道务数库中运行 种匹配通常着数库状态改变发生通删UI中务实例视图中行实现 指示务状态作心跳例程部分进行验证确定务处死状态时终止务
Cluster Policy集群政策
Your local airflow settings file can define a policy function that has the ability to mutate task attributes based on other task or DAG attributes It receives a single argument as a reference to task objects and is expected to alter its attributes 您气流设置文件定义策略功该功根务DAG属性更改务属性 接收单参数作务象引更改属性
For example this function could apply a specific queue property when using a specific operator or enforce a task timeout policy making sure that no tasks run for more than 48 hours Here’s an example of what this may look like inside your airflow_settingspy 例函数特定运算符时应特定队列属性强制执行务超时策略确保没务运行超48时 airflow_settingspy部示例:
def policy(task)
if task__class____name__ 'HivePartitionSensor'
taskqueue sensor_queue
if tasktimeout > timedelta(hours48)
tasktimeout timedelta(hours48)
Documentation & Notes文档注释
It’s possible to add documentation or notes to your dags & task objects that become visible in the web interface (Graph View for dags Task Details for tasks) There are a set of special task attributes that get rendered as rich content if defined 网络界面中见务务象添加文档注释(务图形视图务务详细信息) 果定义组特殊务属性属性呈现丰富容:
attribute
rendered to
doc
monospace
doc_json
json
doc_yaml
yaml
doc_md
markdown
doc_rst
reStructuredText
Please note that for DAGs doc_md is the only attribute interpreted 请注意DAGsdoc_md唯解释属性
This is especially useful if your tasks are built dynamically from configuration files it allows you to expose the configuration that led to the related tasks in Airflow 果您务根配置文件动态构建功特允许您Airflow中公开导致相关务配置

### My great DAG


dag DAG('my_dag' default_argsdefault_args)
dagdoc_md __doc__

t BashOperator(foo dagdag)
tdoc_md \
#Title
Here's a [url](wwwairbnbcom)

This content will get rendered as markdown respectively in the Graph View and Task Details pages 容分图形视图务详细信息页面中显示减价
Jinja Templating
Jinja模板

Airflow leverages the power of Jinja Templating and this can be a powerful tool to use in combination with macros (see the Macros section)
气流利Jinja Templating强功宏结合功强工具(请参见宏部分)
For example say you want to pass the execution date as an environment variable to a Bash script using the BashOperator
例假设您想BashOperator执行日期作环境变量传递Bash脚
# The execution date as YYYYMMDD
date {{ ds }}
t BashOperator(
task_id'test_env'
bash_command'tmptestsh '
dagdag
env{'EXECUTION_DATE' date})
Here {{ ds }} is a macro and because the env parameter of the BashOperator is templated with Jinja the execution date will be available as an environment variable named EXECUTION_DATE in your Bash script
里{{ds}}宏BashOperatorenv参数Jinja模板化执行日期您Bash脚中作名EXECUTION_DATE环境变量提供
You can use Jinja templating with every parameter that is marked as templated in the documentation Template substitution occurs just before the pre_execute function of your operator is called
您文档中标记 templated参数Jinja模板 模板换发生调运算符pre_execute函数前
Packaged dags
包dags
While often you will specify dags in a single py file it might sometimes be required to combine dag and its dependencies For example you might want to combine several dags together to version them together or you might want to manage them together or you might need an extra module that is not available by default on the system you are running airflow on To allow this you can create a zip file that contains the dag(s) in the root of the zip file and have the extra modules unpacked in directories
然通常您会单py文件中指定dag时需结合dag赖项 例您希dag组合起起进行版控制者希起理者需额外模块该模块运行气流系统默认情况 您创建zip文件该文件zip文件根目录中包含dag目录中解压缩模块
For instance you can create a zip file that looks like this
例您创建示zip文件:
my_dag1py
my_dag2py
package1__init__py
package1functionspy
Airflow will scan the zip file and try to load my_dag1py and my_dag2py It will not go into subdirectories as these are considered to be potential packages
Airflow扫描该zip文件尝试加载my_dag1pymy_dag2py 会进入子目录认潜软件包
In case you would like to add module dependencies to your DAG you basically would do the same but then it is more to use a virtualenv and pip
果您想模块赖项添加DAG中基样做virtualenvpip更
virtualenv zip_dag
source zip_dagbinactivate

mkdir zip_dag_contents
cd zip_dag_contents

pip install installoptioninstalllibPWD my_useful_package
cp ~my_dagpy

zip r zip_dagzip *
Note
the zip file will be inserted at the beginning of module search list (syspath) and as such it will be available to any other code that resides within the same interpreter
zip文件插入模块搜索列表(syspath)开头驻留解释器中代码均
Note
packaged dags cannot be used with pickling turned on
包dags启酸洗情况
Note
packaged dags cannot contain dynamic libraries (eg libzso) these need to be available on the system if a module needs those In other words only pure python modules can be packaged
包dag包含动态库(例libzso)果模块需动态库动态库需系统 换句话说包纯python模块
airflowignore
A airflowignore file specifies the directories or files in DAG_FOLDER that Airflow should intentionally ignore Each line in airflowignore specifies a regular expression pattern and directories or files whose names (not DAG id) match any of the patterns would be ignored (under the hood refindall() is used to match the pattern) Overall it works like a gitignore file
airflowignore文件指定DAG_FOLDER中Airflow应该意忽略目录文件 airflowignore中行指定正表达式模式名称(非DAG id)模式匹配目录文件忽略(幕refindall()匹配该模式)总体言工作方式类似gitignore文件
airflowignore file should be put in your DAG_FOLDER For example you can prepare a airflowignore file with contents
airflowignore文件应放您DAG_FOLDER中例您准备包含容airflowignore文件
project_a
tenant_[\d]
Then files like project_a_dag_1py TESTING_project_apy tenant_1py project_adag_1py and tenant_1dag_1py in your DAG_FOLDER would be ignored (If a directory’s name matches any of the patterns this directory and all its subfolders would not be scanned by Airflow at all This improves efficiency of DAG finding)
然DAG_FOLDER中文件 project_a_dag_1py TESTING_project_apy tenant_1py project_a dag_1py tenant_1 dag_1py忽略(果目录名称模式中Airflow根会扫描目录子文件夹提高DAG查找效率)
The scope of a airflowignore file is the directory it is in plus all its subfolders You can also prepare airflowignore file for a subfolder in DAG_FOLDER and it would only be applicable for that subfolder
airflowignore文件范围目录子文件夹您DAG_FOLDER中子文件夹准备airflowignore文件该文件仅适该子文件夹
数分析
Note
Adhoc Queries and Charts are no longer supported in the new FABbased webserver and UI due to security concerns
安全方面考虑新基FABWeb服务器UI支持特查询图表
Part of being productive with data is having the right weapons to profile the data you are working with Airflow provides a simple query interface to write SQL and get results quickly and a charting application letting you visualize data
提高数生产力部分原拥正确武器剖析您正数 Airflow提供简单查询界面编写SQL快速获取结果提供图表应程序您视化数
席查询
Ad Hoc Query UI允许户Airflow中注册数库连接进行简单SQL交互

图表
A simple UI built on top of flaskadmin and highcharts allows building data visualizations and charts easily Fill in a form with a label SQL chart type pick a source database from your environment’s connections select a few other options and save it for later use
基flaskadminhighcharts构建简单UI轻松构建数视化图表 填写带标签SQL图表类型表单环境连接中选择源数库选择选项然保存备
You can even use the same templating and macros available when writing airflow pipelines parameterizing your queries and modifying parameters directly in the URL
您甚编写气流道参数化查询直接URL中修改参数时相模板宏
These charts are basic but they’re easy to create modify and share
图表基图表易创建修改享

图表截图


Chart Form Screenshot

命令行参数
Airflow命令行参数拥非常丰富功DAG进行种操作启动服务支持开发测试
法 airflow [h]
{resetdbrendervariablesconnectionsuserspausesync_permtask_failed_depsversiontrigger_daginitdbtestunpauselist_dag_runsdag_staterunlist_tasksbackfilllist_dagskerberosworkerwebserverflowerschedulertask_statepoolserve_logsclearnext_executionupgradedbdelete_dag}

位置参数
subcommand
Possible choices resetdb render variables connections users pause sync_perm task_failed_deps version trigger_dag initdb test unpause list_dag_runs dag_state run list_tasks backfill list_dags kerberos worker webserver flower scheduler task_state pool serve_logs clear next_execution upgradedb delete_dag
subcommand help
子命令:
resetdb
清重建元数库
airflow resetdb [h] [y]
命名参数
y yes
提示确认重置数库心
默认值 False
render
Render a task instance’s template(s)
渲染务实例模板
airflow render [h] [sd SUBDIR] dag_id task_id execution_date
位置参数
dag_id
The id of the dag
task_id
The id of the task
execution_date
The execution date of the DAG
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
Default [AIRFLOW_HOME]dags
variables
CRUD operations on variables
airflow variables [h] [s KEY VAL] [g KEY] [j] [d VAL] [i FILEPATH]
[e FILEPATH] [x KEY]
命名参数
s set
Set a variable
g get
Get value of a variable
j json
Deserialize JSON variable
Default False
d default
Default value returned if variable does not exist
i import
Import variables from JSON file
e export
Export variables to JSON file
x delete
Delete a variable
connections
ListAddDelete connections
airflow connections [h] [l] [a] [d] [conn_id CONN_ID]
[conn_uri CONN_URI] [conn_extra CONN_EXTRA]
[conn_type CONN_TYPE] [conn_host CONN_HOST]
[conn_login CONN_LOGIN] [conn_password CONN_PASSWORD]
[conn_schema CONN_SCHEMA] [conn_port CONN_PORT]
命名参数
l list
List all connections
Default False
a add
Add a connection
Default False
d delete
Delete a connection
Default False
conn_id
Connection id required to adddelete a connection
conn_uri
Connection URI required to add a connection without conn_type
conn_extra
Connection Extra field optional when adding a connection
conn_type
Connection type required to add a connection without conn_uri
conn_host
Connection host optional when adding a connection
conn_login
Connection login optional when adding a connection
conn_password
 
Connection password optional when adding a connection
conn_schema
Connection schema optional when adding a connection
conn_port
Connection port optional when adding a connection
users
ListCreateDelete users
airflow users [h] [l] [c] [d] [username USERNAME] [email EMAIL]
[firstname FIRSTNAME] [lastname LASTNAME] [role ROLE]
[password PASSWORD] [use_random_password]
命名参数
l list
List all users
Default False
c create
Create a user
Default False
d delete
Delete a user
Default False
username
Username of the user required to createdelete a user
email
Email of the user required to create a user
firstname
First name of the user required to create a user
lastname
Last name of the user required to create a user
role
Role of the user Existing roles include Admin User Op Viewer and Public Required to create a user
password
Password of the user required to create a user without –use_random_password
use_random_password
 
Do not prompt for password Use random string instead Required to create a user without –password
Default False
pause
Pause a DAG
暂停DAG
airflow pause [h] [sd SUBDIR] dag_id
位置参数
位置参数
dag_id
The id of the dag
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
Default [AIRFLOW_HOME]dags
sync_perm
更新现角色权限
airflow sync_perm [h]
task_failed_deps
Returns the unmet dependencies for a task instance from the perspective of the scheduler In other words why a task instance doesn’t get scheduled and then queued by the scheduler and then run by an executor) 调度程序角度返回务实例未满足赖关系 换句话说什务实例没调度然调度程序排队然执行者运行)
airflow task_failed_deps [h] [sd SUBDIR] dag_id task_id execution_date
位置参数
dag_id
The id of the dag
task_id
The id of the task
execution_date
The execution date of the DAG
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’ 中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
Default [AIRFLOW_HOME]dags
version
显示版
airflow version [h]
trigger_dag
触发DAG运行
airflow trigger_dag [h] [sd SUBDIR] [r RUN_ID] [c CONF] [e EXEC_DATE]
dag_id
位置参数
dag_id
The id of the dag
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’ 中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
Default [AIRFLOW_HOME]dags
r run_id
Helps to identify this run
c conf
JSON string that gets pickled into the DagRun’s conf attribute浸入DagRunconf属性中JSON字符串
e exec_date
 
The execution date of the DAG
initdb
初始化元数库
airflow initdb [h]
test
Test a task instance This will run a task without checking for dependencies or recording its state in the database 测试务实例 检查赖关系状态记录数库中情况运行务
airflow test [h] [sd SUBDIR] [dr] [tp TASK_PARAMS]
dag_id task_id execution_date
位置参数
dag_id
The id of the dag
task_id
The id of the task
execution_date
The execution date of the DAG
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’ 中查找dag文件位置目录 默认 [AIRFLOW_HOME] dags中[AIRFLOW_HOME]您 airflowcfg中设置 AIRFLOW_HOME配置中设置值
Default [AIRFLOW_HOME]dags
dr dry_run
Perform a dry run
Default False
tp task_params
 
JSON参数字典发送务
unpause
Resume a paused DAG
airflow unpause [h] [sd SUBDIR] dag_id
位置参数
dag_id
The id of the dag
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
list_dag_runs
List dag runs given a DAG id If state option is given it will onlysearch for all the dagruns with the given state If no_backfill option is given it will filter outall backfill dagruns for given dag id 定DAG ID列出dag运行 果指定state选项仅搜索具定状态dagrun 果指定no_backfill选项滤定dag ID回填dagrun
airflow list_dag_runs [h] [no_backfill] [state STATE] dag_id
位置参数
dag_id
The id of the dag
命名参数
no_backfill
filter all the backfill dagruns given the dag id
Default False
state
Only list the dag runs corresponding to the state
dag_state
获取DAG Run状态
airflow dag_state [h] [sd SUBDIR] dag_id execution_date
位置参数
dag_id
The id of the dag
execution_date
The execution date of the DAG
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
run
Run a single task instance
airflow run [h] [sd SUBDIR] [m] [f] [pool POOL] [cfg_path CFG_PATH]
[l] [A] [i] [I] [ship_dag] [p PICKLE] [int]
dag_id task_id execution_date
位置参数
dag_id
The id of the dag
task_id
The id of the task
execution_date
The execution date of the DAG
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
m mark_success
 
Mark jobs as succeeded without running them
Default False
f force
Ignore previous task instance state rerun regardless if task already succeededfailed
Default False
pool
Resource pool to use
cfg_path
Path to config file to use instead of airflowcfg
l local
Run the task using the LocalExecutor
Default False
A ignore_all_dependencies
 
Ignores all noncritical dependencies including ignore_ti_state and ignore_task_deps
Default False
i ignore_dependencies
 
Ignore taskspecific dependencies eg upstream depends_on_past and retry delay dependencies
Default False
I ignore_depends_on_past
 
Ignore depends_on_past dependencies (but respect upstream dependencies)
Default False
ship_dag
Pickles (serializes) the DAG and ships it to the worker
Default False
p pickle
Serialized pickle object of the entire dag (used internally)
int interactive
 
Do not capture standard output and error streams (useful for interactive debugging)
Default False
list_tasks
List the tasks within a DAG
airflow list_tasks [h] [t] [sd SUBDIR] dag_id
位置参数
dag_id
The id of the dag
命名参数
t tree
Tree view
Default False
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
backfill
Run subsections of a DAG for a specified date range If reset_dag_run option is used backfill will first prompt users whether airflow should clear all the previous dag_run and task_instances within the backfill date range If rerun_failed_tasks is used backfill will auto rerun the previous failed task instances within the backfill date range 指定日期范围运行DAG节 果reset_dag_run选项回填首先提示户气流否应清回填日期范围先前dag_runtask_instances 果rerun_failed_tasks回填动回填日期范围重新运行前失败务实例
airflow backfill [h] [t TASK_REGEX] [s START_DATE] [e END_DATE] [m] [l]
[x] [i] [I] [sd SUBDIR] [pool POOL]
[delay_on_limit DELAY_ON_LIMIT] [dr] [v] [c CONF]
[reset_dagruns] [rerun_failed_tasks]
dag_id
位置参数
dag_id
The id of the dag
命名参数
t task_regex
 
The regex to filter specific task_ids to backfill (optional)
s start_date
 
Override start_date YYYYMMDD
e end_date
Override end_date YYYYMMDD
m mark_success
 
Mark jobs as succeeded without running them
Default False
l local
Run the task using the LocalExecutor
Default False
x donot_pickle
 
Do not attempt to pickle the DAG object to send over to the workers just tell the workers to run their version of the code
Default False
i ignore_dependencies
 
Skip upstream tasks run only the tasks matching the regexp Only works in conjunction with task_regex
Default False
I ignore_first_depends_on_past
 
Ignores depends_on_past dependencies for the first set of tasks only (subsequent executions in the backfill DO respect depends_on_past)
Default False
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
pool
Resource pool to use
delay_on_limit
 
Amount of time in seconds to wait when the limit on maximum active dag runs (max_active_runs) has been reached before trying to execute a dag run again
Default 10
dr dry_run
Perform a dry run
Default False
v verbose
Make logging output more verbose
Default False
c conf
JSON string that gets pickled into the DagRun’s conf attribute
reset_dagruns
 
if set the backfill will delete existing backfillrelated DAG runs and start anew with fresh running DAG runs
Default False
rerun_failed_tasks
 
if set the backfill will autorerun all the failed tasks for the backfill date range instead of throwing exceptions
Default False
list_dags
列出DAG
airflow list_dags [h] [sd SUBDIR] [r]
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
r report
Show DagBag loading report
Default False
kerberos
Start a kerberos ticket renewer
airflow kerberos [h] [kt [KEYTAB]] [pid [PID]] [D] [stdout STDOUT]
[stderr STDERR] [l LOG_FILE]
[principal]
位置参数
principal
kerberos principal
Default airflow
命名参数
kt keytab
keytab
Default airflowkeytab
pid
PID file location
D daemon
Daemonize instead of running in the foreground
Default False
stdout
Redirect stdout to this file
stderr
Redirect stderr to this file
l logfile
Location of the log file
worker
Start a Celery worker node
airflow worker [h] [p] [q QUEUES] [c CONCURRENCY] [cn CELERY_HOSTNAME]
[pid [PID]] [D] [stdout STDOUT] [stderr STDERR]
[l LOG_FILE] [a AUTOSCALE]
命名参数
p do_pickle
 
Attempt to pickle the DAG object to send over to the workers instead of letting workers run their version of the code
Default False
q queues
Comma delimited list of queues to serve
Default default
c concurrency
 
The number of worker processes
Default 16
cn celery_hostname
 
Set the hostname of celery worker if you have multiple workers on a single machine
pid
PID file location
D daemon
Daemonize instead of running in the foreground
Default False
stdout
Redirect stdout to this file
stderr
Redirect stderr to this file
l logfile
Location of the log file
a autoscale
 
Minimum and Maximum number of worker to autoscale
webserver
Start a Airflow webserver instance
airflow webserver [h] [p PORT] [w WORKERS]
[k {synceventletgeventtornado}] [t WORKER_TIMEOUT]
[hn HOSTNAME] [pid [PID]] [D] [stdout STDOUT]
[stderr STDERR] [A ACCESS_LOGFILE] [E ERROR_LOGFILE]
[l LOG_FILE] [ssl_cert SSL_CERT] [ssl_key SSL_KEY] [d]
命名参数
p port
The port on which to run the server
Default 8080
w workers
Number of workers to run the webserver on
Default 4
k workerclass
 
Possible choices sync eventlet gevent tornado
The worker class to use for Gunicorn
Default sync
t worker_timeout
 
The timeout for waiting on webserver workers
Default 120
hn hostname
 
Set the hostname on which to run the web server
Default 0000
pid
PID file location
D daemon
Daemonize instead of running in the foreground
Default False
stdout
Redirect stdout to this file
stderr
Redirect stderr to this file
A access_logfile
 
The logfile to store the webserver access log Use to print to stderr
Default
E error_logfile
 
The logfile to store the webserver error log Use to print to stderr
Default
l logfile
Location of the log file
ssl_cert
Path to the SSL certificate for the webserver
ssl_key
Path to the key to use with the SSL certificate
d debug
Use the server that ships with Flask in debug mode
Default False
flower
启动Celery Flower
airflow flower [h] [hn HOSTNAME] [p PORT] [fc FLOWER_CONF] [u URL_PREFIX]
[a BROKER_API] [pid [PID]] [D] [stdout STDOUT]
[stderr STDERR] [l LOG_FILE]
命名参数
hn hostname
 
Set the hostname on which to run the server
Default 0000
p port
The port on which to run the server
Default 5555
fc flower_conf
 
Configuration file for flower
u url_prefix
 
URL prefix for Flower
a broker_api
 
Broker api
pid
PID file location
D daemon
Daemonize instead of running in the foreground
Default False
stdout
Redirect stdout to this file
stderr
Redirect stderr to this file
l logfile
Location of the log file
scheduler
启动调度程序实例
airflow scheduler [h] [d DAG_ID] [sd SUBDIR] [r RUN_DURATION]
[n NUM_RUNS] [p] [pid [PID]] [D] [stdout STDOUT]
[stderr STDERR] [l LOG_FILE]
命名参数
d dag_id
The id of the dag to run
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
r runduration
 
Set number of seconds to execute before exiting
n num_runs
Set the number of runs to execute before exiting
Default 1
p do_pickle
 
Attempt to pickle the DAG object to send over to the workers instead of letting workers run their version of the code
Default False
pid
PID file location
D daemon
Daemonize instead of running in the foreground
Default False
stdout
Redirect stdout to this file
stderr
Redirect stderr to this file
l logfile
Location of the log file
task_state
获取务实例状态
airflow task_state [h] [sd SUBDIR] dag_id task_id execution_date
位置参数
dag_id
The id of the dag
task_id
The id of the task
execution_date
The execution date of the DAG
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
pool
CRUD operations on pools
airflow pool [h] [s NAME SLOT_COUNT POOL_DESCRIPTION] [g NAME] [x NAME]
[i FILEPATH] [e FILEPATH]
命名参数
s set
Set pool slot count and description respectively
g get
Get pool info
x delete
Delete a pool
i import
Import pool from JSON file
e export
Export pool to JSON file
serve_logs
Serve logs generate by worker
airflow serve_logs [h]
clear
Clear a set of task instance as if they never ran
airflow clear [h] [t TASK_REGEX] [s START_DATE] [e END_DATE] [sd SUBDIR]
[u] [d] [c] [f] [r] [x] [xp] [dx]
dag_id
位置参数
dag_id
The id of the dag
命名参数
t task_regex
 
The regex to filter specific task_ids to backfill (optional)
s start_date
 
Override start_date YYYYMMDD
e end_date
Override end_date YYYYMMDD
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
u upstream
Include upstream tasks
Default False
d downstream
 
Include downstream tasks
Default False
c no_confirm
 
Do not request confirmation
Default False
f only_failed
 
Only failed jobs
Default False
r only_running
 
Only running jobs
Default False
x exclude_subdags
 
Exclude subdags
Default False
xp exclude_parentdag
 
Exclude ParentDAGS if the task cleared is a part of a SubDAG
Default False
dx dag_regex
 
Search dag_id as regex instead of exact string
Default False
next_execution
获取DAG次执行日期时间
airflow next_execution [h] [sd SUBDIR] dag_id
位置参数
dag_id
The id of the dag
命名参数
sd subdir
File location or directory from which to look for the dag Defaults to [AIRFLOW_HOME]dags’ where [AIRFLOW_HOME] is the value you set for AIRFLOW_HOME’ config you set in airflowcfg’
Default [AIRFLOW_HOME]dags
upgradedb
元数库升级新版
airflow upgradedb [h]
delete_dag
删指定DAG相关数库记录
airflow delete_dag [h] [y] dag_id
位置参数
dag_id
The id of the dag
命名参数
y yes
Do not prompt to confirm reset Use with care
Default False
调度触发器
Airflow调度程序监视务DAG触发已满足赖关系务实例 台监视文件夹中包含DAG象保持步定期(分钟左右)检查活动务查否触发
Airflow计划程序设计Airflow生产环境中作持久性服务运行首先您需执行Airflow调度程序airflowcfg中指定配置
请注意果您天schedule_interval运行DAG20160101T2359久触发标记20160101运行换句话说作业实例结束期启动
Let’s Repeat That The scheduler runs your job one schedule_interval AFTER the start date at the END of the period 重复遍计划程序会开始日期(周期结束)schedule_interval运行您作业
调度程序启动airflowcfg中指定执行程序实例 果碰巧LocalExecutor务作子流程执行 CeleryExecutor MesosExecutor务远程执行
启动调度程序需运行命令:
airflow scheduler
DAG Run
DAG Run表示DAG时实例化象
Each DAG may or may not have a schedule which informs how DAG Runs are created schedule_interval is defined as a DAG arguments and receives preferably a cron expression as a str or a datetimetimedelta object Alternatively you can also use one of these cron preset DAG没时间表该时间表会通知创建DAG Runschedule_interval定义DAG参数接收cron表达式作strdatetimetimedelta象 外您cron预设:
预设
含义
cron
None
计划运行DAG仅外部触发
 
@once
Schedule once and only once仅安排次
 
@hourly
Run once an hour at the beginning of the hour时开始次时运行次
0 * * * *
@daily
Run once a day at midnight天半夜运行次
0 0 * * *
@weekly
Run once a week at midnight on Sunday morning周日周日午夜运行次
0 0 * * 0
@monthly
Run once a month at midnight of the first day of the month月月第天午夜运行次
0 0 1 * *
@yearly
Run once a year at midnight of January 1年1月1日午夜运行次
0 0 1 1 *
注意:您想计划运行DAG时请schedule_intervalNoneschedule_interval'None'
Your DAG will be instantiated for each schedule while creating a DAG Run entry for each schedule 计划实例化DAG时计划创建DAG运行条目
DAG runs have a state associated to them (running failed success) and informs the scheduler on which set of schedules should be evaluated for task submissions Without the metadata at the DAG run level the Airflow scheduler would have much more work to do in order to figure out what tasks should be triggered and come to a crawl It might also create undesired processing when changing the shape of your DAG by say adding in new tasks DAG Run具相关状态(运行失败成功)通知调度程序应针务提交评估调度程序集 果没DAG Run级元数Airflow计划程序做更工作便确定应触发务进行爬网 通说添加新务更改DAG形状时会产生需处理
BackfillCatchup
An Airflow DAG with a start_date possibly an end_date and a schedule_interval defines a series of intervals which the scheduler turn into individual Dag Runs and execute A key capability of Airflow is that these DAG Runs are atomic idempotent items and the scheduler by default will examine the lifetime of the DAG (from start to endnow one interval at a time) and kick off a DAG Run for any interval that has not been run (or has been cleared) This concept is called Catchup 带开始日期(结束日期)schedule_intervalAirflow DAG定义系列间隔计划程序间隔转换单独Dag运行执行 Airflow项关键功DAG运行原子幂等项默认情况调度程序检查DAG寿命(开始结束现次间隔时间)然启动DAG运行 尚未运行(已清)间隔 概念称追赶
If your DAG is written to handle its own catchup (IE not limited to the interval but instead to Now for instance) then you will want to turn catchup off (Either on the DAG itself with dagcatchup False) or by default at the configuration file level with catchup_by_default False What this will do is to instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series 果您DAG编写处理身catchup(例IE限间隔改Now)您需关闭追赶(DAG身dagcatchup False) 默认情况配置文件级catchup_by_default False 指示调度程序仅DAG间隔系列新实例创建DAG运行

Code that goes along with the Airflow tutorial located at
httpsgithubcomapacheincubatorairflowblobmasterairflowexample_dagstutorialpy

from airflow import DAG
from airflowoperatorsbash_operator import BashOperator
from datetime import datetime timedelta


default_args {
'owner' 'airflow'
'depends_on_past' False
'start_date' datetime(2015 12 1)
'email' ['airflow@examplecom']
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes5)
}

dag DAG(
'tutorial'
default_argsdefault_args
description'A simple tutorial DAG'
schedule_interval'@hourly'
catchupFalse)
In the example above if the DAG is picked up by the scheduler daemon on 20160102 at 6 AM (or from the command line) a single DAG Run will be created with an execution_date of 20160101 and the next one will be created just after midnight on the morning of 20160103 with an execution date of 20160102 面示例中果调度程序守护程序2016年1月2日午6点(命令行)拾取DAG创建DAG运行执行日期20160101 2016年1月3日午午夜创建执行日期2016年1月2日
If the dagcatchup value had been True instead the scheduler would have created a DAG Run for each completed interval between 20151201 and 20160102 (but not yet one for 20160102 as that interval hasn’t completed) and the scheduler will execute them sequentially This behavior is great for atomic datasets that can easily be split into periods Turning catchup off is great if your DAG Runs perform backfill internally 果dagcatchup值改True调度程序2015120120160102间已完成间隔创建DAG Run(尚未20160102创建DAG Run该间隔 尚未完成)调度程序序执行 行轻松拆分周期原子数集非常 果您DAG运行部执行回填关闭追赶非常
外部触发
Note that DAG Runs can also be created manually through the CLI while running an airflow trigger_dag command where you can define a specific run_id The DAG Runs created externally to the scheduler get associated to the trigger’s timestamp and will be displayed in the UI alongside scheduled DAG runs 请注意运行airflow trigger_dag命令时通CLI手动创建DAG运行您中定义特定run_id 调度程序外部创建DAG运行触发器时间戳相关联已调度DAG运行起显示户界面中
In addition you can also manually trigger a DAG Run using the web UI (tab DAGs > column Links > button Trigger Dag) 外您Web UI(选项卡 DAG>链接>触发)钮手动触发DAG Run
牢记
· The first DAG Run is created based on the minimum start_date for the tasks in your DAG 根DAG中务开始日期创建第DAG Run
· Subsequent DAG Runs are created by the scheduler process based on your DAG’s schedule_interval sequentially DAG Runs调度程序根DAGschedule_interval序创建
· When clearing a set of tasks’ state in hope of getting them to rerun it is important to keep in mind the DAG Run’s state too as it defines whether the scheduler should look into triggering tasks for that run 清组务状态希重新运行时请务必牢记DAG Run状态定义调度程序否应考虑触发该运行务
Here are some of the ways you can unblock tasks 解阻止务方法:
· From the UI you can clear (as in delete the status of) individual task instances from the task instances dialog while defining whether you want to includes the pastfuture and the upstreamdownstream dependencies Note that a confirmation window comes next and allows you to see the set you are about to clear You can also clear all task instances associated with the dag 户界面中您务实例话框中清(删状态样)单务实例时定义否包括游游赖关系 请注意接会出现确认窗口您通该窗口查清设置 您清dag关联务实例
· The CLI command airflow clear h has lots of options when it comes to clearing task instance states including specifying date ranges targeting task_ids by specifying a regular expression flags for including upstream and downstream relatives and targeting task instances in specific states (failed or success) 清务实例状态时CLI命令airflow clear h选项包括指定日期范围通指定正表达式定位task_id包括游游亲属标志特定状态定位务实例(失败成功)
· Clearing a task instance will no longer delete the task instance record Instead it updates max_tries and set the current task instance state to be None 清务实例删务实例记录 相反更新max_tries前务实例状态设置None
· Marking task instances as failed can be done through the UI This can be used to stop running task instances 通UI务实例标记失败 停止运行务实例
· Marking task instances as successful can be done through the UI This is mostly to fix false negatives or for instance when the fix has been applied outside of Airflow 通UI务实例标记成功 解决误报问题例修复Airflow外应解决方案
· The airflow backfill CLI subcommand has a flag to mark_success and allows selecting subsections of the DAG as well as specifying date ranges airflow backfill CLI子命令具mark_success标志允许选择DAG节指定日期范围
插件
Airflow has a simple plugin manager builtin that can integrate external features to its core by simply dropping files in your AIRFLOW_HOMEplugins folder Airflow置简单插件理器需文件拖放AIRFLOW_HOMEplugins文件夹中便外部功集成核心中
导入plugins文件夹中python模块hooksoperatorssensorsmacrosexecutorsWeb view已集成Airflow程序集中
What for
Airflow offers a generic toolbox for working with data Different organizations have different stacks and different needs Using Airflow plugins can be a way for companies to customize their Airflow installation to reflect their ecosystem Airflow提供处理数通工具箱 组织具堆栈需求 Airflow插件成公司定义Airflow安装反映生态系统种方式
Plugins can be used as an easy way to write share and activate new sets of features 插件作编写享激活新功集简便方法
There’s also a need for a set of more complex applications to interact with different flavors of data and metadata 需组更复杂应程序类型数元数进行交互
实例:
· A set of tools to parse Hive logs and expose Hive metadata (CPU IO phases skew …) 套解析Hive日志公开Hive元数工具(CPU IO 阶段偏斜…)
· An anomaly detection framework allowing people to collect metrics set thresholds and alerts异常检测框架允许收集指标设置阈值警报
· An auditing tool helping understand who accesses what审核工具助解谁访问什
· A configdriven SLA monitoring tool allowing you to set monitored tables and at what time they should land alert people and expose visualizations of outages配置驱动SLA监视工具您设置监视表时登陆提醒员显示停机情况
· …
Why build on top of Airflow
Airflow具构建应程序时重组件:
· A web server you can use to render your views
· A metadata database to store your models
· Access to your databases and knowledge of how to connect to them
· An array of workers that your application can push workload to
· Airflow is deployed you can just piggy back on its deployment logistics
· Basic charting capabilities underlying libraries and abstractions
Interface
To create a plugin you will need to derive the airflowplugins_managerAirflowPlugin class and reference the objects you want to plug into Airflow Here’s what the class you need to derive looks like 创建插件您需派生airflowplugins_managerAirflowPlugin类引插入Airflow象 您需派生类示:
class AirflowPlugin(object)
# The name of your plugin (str)
name None
# A list of class(es) derived from BaseOperator
operators []
# A list of class(es) derived from BaseSensorOperator
sensors []
# A list of class(es) derived from BaseHook
hooks []
# A list of class(es) derived from BaseExecutor
executors []
# A list of references to inject into the macros namespace
macros []
# A list of objects created from a class derived
# from flask_adminBaseView
admin_views []
# A list of Blueprint object created from flaskBlueprint
flask_blueprints []
# A list of menu links (flask_adminbaseMenuLink)
menu_links []
You can derive it by inheritance (please refer to the example below) Please note name inside this class must be specified 您通继承派生(请参考面示例) 请注意必须类指定名称
After the plugin is imported into Airflow you can invoke it using statement like插件导入Airflow您语句调:
from airflow{type like operators sensors}{name specificed inside the plugin class} import *
When you write your own plugins make sure you understand them well There are some essential properties for each type of plugin For example 您编写插件时请确保您解 种类型插件基属性 例
· For Operator plugin an execute method is compulsory
· For Sensor plugin a poke method returning a Boolean value is compulsory
实例
The code below defines a plugin that injects a set of dummy object definitions in Airflow 面代码定义插件该插件Airflow中注入组虚拟象定义
# This is the class you derive to create a plugin
from airflowplugins_manager import AirflowPlugin

from flask import Blueprint
from flask_admin import BaseView expose
from flask_adminbase import MenuLink

# Importing base classes that we need to derive
from airflowhooksbase_hook import BaseHook
from airflowmodels import BaseOperator
from airflowsensorsbase_sensor_operator import BaseSensorOperator
from airflowexecutorsbase_executor import BaseExecutor

# Will show up under airflowhookstest_pluginPluginHook
class PluginHook(BaseHook)
pass

# Will show up under airflowoperatorstest_pluginPluginOperator
class PluginOperator(BaseOperator)
pass

# Will show up under airflowsensorstest_pluginPluginSensorOperator
class PluginSensorOperator(BaseSensorOperator)
pass

# Will show up under airflowexecutorstest_pluginPluginExecutor
class PluginExecutor(BaseExecutor)
pass

# Will show up under airflowmacrostest_pluginplugin_macro
def plugin_macro()
pass

# Creating a flask admin BaseView
class TestView(BaseView)
@expose('')
def test(self)
# in this example put your test_plugintesthtml template at airflowpluginstemplatestest_plugintesthtml
return selfrender(test_plugintesthtml contentHello galaxy)
v TestView(categoryTest Plugin nameTest View)

# Creating a flask blueprint to integrate the templates and static folder
bp Blueprint(
test_plugin __name__
template_folder'templates' # registers airflowpluginstemplates as a Jinja template folder
static_folder'static'
static_url_path'statictest_plugin')

ml MenuLink(
category'Test Plugin'
name'Test Menu Link'
url'httpsairflowincubatorapacheorg')

# Defining the plugin class
class AirflowTestPlugin(AirflowPlugin)
name test_plugin
operators [PluginOperator]
sensors [PluginSensorOperator]
hooks [PluginHook]
executors [PluginExecutor]
macros [plugin_macro]
admin_views [v]
flask_blueprints [bp]
menu_links [ml]
安全性
By default all gates are opened An easy way to restrict access to the web application is to do it at the network level or by using SSH tunnels 默认情况门均处开状态 限制Web应程序访问种简单方法网络级SSH隧道进行访问
It is however possible to switch on authentication by either using one of the supplied backends or creating your own 通提供端创建端开身份验证
Be sure to checkout Experimental Rest API for securing the API 请务必签出Experimental Rest API保护API
Note
Airflow uses the config parser of Python This config parser interpolates ’signs Make sure escape any signs in your config file (but not environment variables) as otherwise Airflow might leak these passwords on a config parser exception to a log AirflowPython配置解析器 配置解析器插入%符号 确保配置文件(环境变量)中%符号转义否Airflow会配置解析器异常中密码泄漏日志中
Web身份验证
密码
Note
This is for flaskadmin based web UI only If you are using FABbased web UI with RBAC feature please use command line interface airflow users create to create accounts or do that in the FABbased UI itself 仅适基flaskadminWeb UI 果您正具RBAC功基FABWeb UI请命令行界面气流户create创建帐户者基FABUI身中进行操作
One of the simplest mechanisms for authentication is requiring users to specify a password before logging in Password authentication requires the used of the password subpackage in your requirements file Password hashing uses bcrypt before storing passwords 简单身份验证机制求户登录前指定密码密码身份验证需需求文件中密码子包 密码哈希存储密码前bcrypt
[webserver]
authenticate True
auth_backend airflowcontribauthbackendspassword_auth
When password auth is enabled an initial user credential will need to be created before anyone can login An initial user was not created in the migrations for this authentication backend to prevent default Airflow installations from attack Creating a new user has to be done via a Python REPL on the same machine Airflow is installed 启密码验证需创建初始户证登录 迁移中没身份验证端创建初始户防止默认Airflow安装受攻击 必须安装Airflow台计算机通Python REPL创建新户
# navigate to the airflow installation directory
cd ~airflow
python
Python 279 (default Feb 10 2015 032808)
Type help copyright credits or license for more information
>>> import airflow
>>> from airflow import models settings
>>> from airflowcontribauthbackendspassword_auth import PasswordUser
>>> user PasswordUser(modelsUser())
>>> userusername 'new_user_name'
>>> useremail 'new_user_email@examplecom'
>>> userpassword 'set_the_password'
>>> session settingsSession()
>>> sessionadd(user)
>>> sessioncommit()
>>> sessionclose()
>>> exit()
LDAP
To turn on LDAP authentication configure your airflowcfg as follows Please note that the example uses an encrypted connection to the ldap server as you probably do not want passwords be readable on the network level It is however possible to configure without encryption if you really want to 开LDAP身份验证请步骤配置airflowcfg 请注意该示例ldap服务器加密连接您希密码网络级读 果您确实愿意加密进行配置
Additionally if you are using Active Directory and are not explicitly specifying an OU that your users are in you will need to change search_scope to SUBTREE 外果您Active Directory未明确指定户OU需search_scope更改 SUBTREE
Valid search_scope options can be found in the ldap3 Documentationldap3文档中找效search_scope选项
[webserver]
authenticate True
auth_backend airflowcontribauthbackendsldap_auth

[ldap]
# set a connection without encryption uri ldap
uri ldaps
user_filter objectClass*
# in case of Active Directory you would use user_name_attr sAMAccountName
user_name_attr uid
# group_member_attr should be set accordingly with *_filter
# eg
# group_member_attr groupMembership
# superuser_filter groupMembershipCNairflowsuperusers
group_member_attr memberOf
superuser_filter memberOfCNairflowsuperusersOUGroupsOURWCOUUSOUNORAMDCexampleDCcom
data_profiler_filter memberOfCNairflowdataprofilersOUGroupsOURWCOUUSOUNORAMDCexampleDCcom
bind_user cnManagerdcexampledccom
bind_password insecure
basedn dcexampledccom
cacert etccaldap_cacrt
# Set search_scope to one of them BASE LEVEL SUBTREE
# Set search_scope to SUBTREE if using Active Directory and not specifying an Organizational Unit
search_scope LEVEL
The superuser_filter and data_profiler_filter are optional If defined these configurations allow you to specify LDAP groups that users must belong to in order to have superuser (admin) and dataprofiler permissions If undefined all users will be superusers and data profilers superuser_filterdata_profiler_filter选 果定义配置允许您指定户必须属LDAP组具超级户(admin)数配置程序权限 果未定义户均超级户数分析器
Roll your own
Airflow uses flask_login and exposes a set of hooks in the airflowdefault_login module You can alter the content and make it part of the PYTHONPATH and configure it as a backend in airflowcfg Airflowflask_loginairflowdefault_login模块中公开组挂钩 您更改容设置PYTHONPATH部分配置airflowcfg中端
[webserver]
authenticate True
auth_backend mypackageauth
Multitenancy
You can filter the list of dags in webserver by owner name when authentication is turned on by setting webserverfilter_by_owner in your config With this a user will see only the dags which it is owner of unless it is a superuser 启身份验证通配置中设置webserver:filter_by_owner者名称滤Web服务器中列表 样非超级户否户仅者dag
[webserver]
filter_by_owner True
Kerberos
Airflow has initial support for Kerberos This means that airflow can renew kerberos tickets for itself and store it in the ticket cache The hooks and dags can make use of ticket to authenticate against kerberized services Airflow初Kerberos提供支持 意味着气流行更新kerberos票证存储票证缓存中 钩子挂锁利票证Kerberos服务进行身份验证
限制
Please note that at this time not all hooks have been adjusted to make use of this functionality Also it does not integrate kerberos into the web interface and you will have to rely on network level security for now to make sure your service remains secure 请注意时非挂钩已进行调整利功 外会kerberos集成Web界面中您现必须网络级安全性确保您服务保持安全
Celery integration has not been tried and tested yet However if you generate a key tab for every host and launch a ticket renewer next to every worker it will most likely work Celery集成尚未尝试测试 果机生成密钥选项卡工作员旁边启动票证续订会起作
Enabling kerberos
Airflow
To enable kerberos you will need to generate a (service) key tab 启kerberos您需生成(服务)密钥标签
# in the kadminlocal or kadmin shell create the airflow principal
kadmin addprinc randkey airflowfullyqualifieddomainname@YOURREALMCOM

# Create the airflow keytab file that will contain the airflow principal
kadmin xst norandkey k airflowkeytab airflowfullyqualifieddomainname
Now store this file in a location where the airflow user can read it (chmod 600) And then add the following to your airflowcfg 现文件存储气流户读取位置(chmod 600) 然容添加您airflowcfg中
[core]
security kerberos

[kerberos]
keytab etcairflowairflowkeytab
reinit_frequency 3600
principal airflow
Launch the ticket renewer by
# run ticket renewer
airflow kerberos
Hadoop
If want to use impersonation this needs to be enabled in coresitexml of your hadoop config 果模拟需hadoop配置coresitexml中启

hadoopproxyuserairflowgroups
*



hadoopproxyuserairflowusers
*



hadoopproxyuserairflowhosts
*

Of course if you need to tighten your security replace the asterisk with something more appropriate 然果您需加强安全性请更合适名称换星号
Using kerberos authentication
The hive hook has been updated to take advantage of kerberos authentication To allow your DAGs to use it simply update the connection details with for example 蜂巢挂钩已更新利kerberos身份验证 允许您DAG需命令更新连接详细信息:
{ use_beeline true principal hive_HOST@EXAMPLECOM}
Adjust the principal to your settings The _HOST part will be replaced by the fully qualified domain name of the server 体调整您设置 _HOST部分服务器标准域名换
You can specify if you would like to use the dag owner as the user for the connection or the user specified in the login section of the connection For the login user specify the following as extra 您指定dag者作连接户连接登录部分中指定户 登录户请外指定容:
{ use_beeline true principal hive_HOST@EXAMPLECOM proxy_user login}
DAG者请:
{ use_beeline true principal hive_HOST@EXAMPLECOM proxy_user owner}
and in your DAG when initializing the HiveOperator specify 您DAG中初始化HiveOperator时请指定:
run_as_ownerTrue
To use kerberos authentication you must install Airflow with the kerberos extras group kerberos身份验证必须Airflowkerberos extras组起安装:
pip install airflow[kerberos]
OAuth验证
GitHub Enterprise (GHE) Authentication
The GitHub Enterprise authentication backend can be used to authenticate users against an installation of GitHub Enterprise using OAuth2 You can optionally specify a team whitelist (composed of slug cased team names) to restrict login to only members of those teams GitHub Enterprise身份验证端针OAuth2进行GitHub Enterprise安装户进行身份验证 您选择指定团队白名单(笨拙团队名称组成)仅登录限制团队成员
[webserver]
authenticate True
auth_backend airflowcontribauthbackendsgithub_enterprise_auth

[github_enterprise]
host githubexamplecom
client_id oauth_key_from_github_enterprise
client_secret oauth_secret_from_github_enterprise
oauth_callback_route exampleghe_oauthcallback
allowed_teams 1 345 23
Note
If you do not specify a team whitelist anyone with a valid account on your GHE installation will be able to login to Airflow 果您未指定团队白名单您GHE安装中拥效帐户登录Airflow
To use GHE authentication you must install Airflow with the github_enterprise extras group GHE身份验证必须Airflowgithub_enterprise Extras组起安装:
pip install airflow[github_enterprise]
Setting up GHE Authentication设置GHE身份验证
An application must be setup in GHE before you can use the GHE authentication backend In order to setup an application 必须先GHE中设置应程序然GHE身份验证端 设置应程序:
1 Navigate to your GHE profile
2 Select Applications’ from the left hand nav
3 Select the Developer Applications’ tab
4 Click Register new application’
5 Fill in the required information (the Authorization callback URL’ must be fully qualified eg httpairflowexamplecomexampleghe_oauthcallback)
6 Click Register application’
7 Copy Client ID’ Client Secret’ and your callback route to your airflowcfg according to the above example
Using GHE Authentication with githubcom
githubcomGHE身份验证:
1 Create an Oauth App
2 Copy Client ID’ Client Secret’ to your airflowcfg according to the above example
3 Set host githubcom and oauth_callback_route oauthcallback in airflowcfg
Google身份认证
The Google authentication backend can be used to authenticate users against Google using OAuth2 You must specify the email domains to restrict login separated with a comma to only members of those domains Google身份验证端OAuth2Google进行户身份验证 您必须指定电子邮件域登录名(逗号分隔)限制仅域成员
[webserver]
authenticate True
auth_backend airflowcontribauthbackendsgoogle_auth

[google]
client_id google_client_id
client_secret google_client_secret
oauth_callback_route oauth2callback
domain example1comexample2com
To use Google authentication you must install Airflow with the google_auth extras group Google身份验证您必须Airflowgoogle_auth extras组起安装:
pip install airflow[google_auth]
Setting up Google Authentication设置Google身份验证
An application must be setup in the Google API Console before you can use the Google authentication backend In order to setup an application 必须先Google API控制台中设置应程序然Google身份验证端 设置应程序:
1 Navigate to httpsconsoledevelopersgooglecomapis
2 Select Credentials’ from the left hand nav
3 Click Create credentials’ and choose OAuth client ID’
4 Choose Web application’
5 Fill in the required information (the Authorized redirect URIs’ must be fully qualified eg httpairflowexamplecomoauth2callback)
6 Click Create’
7 Copy Client ID’ Client Secret’ and your redirect URI to your airflowcfg according to the above example
SSL
SSL can be enabled by providing a certificate and key Once enabled be sure to use https in your browser 通提供证书密钥启SSL 启请确保浏览器中 https
[webserver]
web_server_ssl_cert
web_server_ssl_key
Enabling SSL will not automatically change the web server port If you want to use the standard port 443 you’ll need to configure that too Be aware that super user privileges (or cap_net_bind_service on Linux) are required to listen on port 443 启SSL会动更改Web服务器端口 果标准端口443需进行配置 请注意侦听端口443需超级户特权(Linuxcap_net_bind_service)
# Optionally set the server to listen on the standard SSL port
web_server_port 443
base_url http443
启带SSLCeleryExecutor确保正确生成客户端服务器证书密钥
[celery]
ssl_active True
ssl_key
ssl_cert
ssl_cacert
代理户身份
Airflow has the ability to impersonate a unix user while running task instances based on the task’s run_as_user parameter which takes a user’s name Airflow运行务实例时根务run_as_user参数模拟unix户该参数带户名
NOTE For impersonations to work Airflow must be run with sudo as subtasks are run with sudo u and permissions of files are changed Furthermore the unix user needs to exist on the worker Here is what a simple sudoers file entry could look like to achieve this assuming as airflow is running as the airflow user Note that this means that the airflow user must be trusted and treated the same way as the root user 注意:模拟正常工作必须sudo运行Airflow子务sudo u运行文件权限已更改 外unix户需存worker 简单sudoers文件条目会实现目假设气流气流户身份运行 请注意意味着必须信气流户root户相方式
airflow ALL(ALL) NOPASSWD ALL
Subtasks with impersonation will still log to the same folder except that the files they log to will have permissions changed such that only the unix user can write to it 具模拟功子务然记录相文件夹中处登录文件权限发生更改Unix户进行写入
默认代理
To prevent tasks that don’t use impersonation to be run with sudo privileges you can set the coredefault_impersonation config which sets a default user impersonate if run_as_user is not set 防止sudo特权运行模拟务您设置coredefault_impersonation配置该配置未设置run_as_user情况设置默认户模拟
[core]
default_impersonation airflow
时区
Airflow默认启时区支持程序部数库中存储UTC日期时间信息允许您时区相关计划运行DAG目前Airflow尚未户界面中转换户时区户界面始终UTC日期时间显示Operator中模板会转换日期时间时间时区信息程序中访问取决DAG编写者处理
果您户分散时区您想根户挂钟显示日期时间信息方便
The main reason is Daylight Saving Time (DST) Many countries have a system of DST where clocks are moved forward in spring and backward in autumn If you’re working in local time you’re likely to encounter errors twice a year when the transitions happen (The pendulum and pytz documentation discusses these issues in greater detail) This probably doesn’t matter for a simple DAG but it’s a problem if you are in for example financial services where you have end of day deadlines to meet 您仅时区中运行Airflow数存储数库中UTC中然种做法(Airflow成时区前建议甚需设置) 原夏令时(DST) 许国家区采夏令时制时钟春季前移动秋季移动 果您时间工作转换时年会遇两次错误(摆锤pytz文档问题进行更详细讨)简单DAG重果您例金融服务等截止日期情况遇问题
airflowcfg文件中包含时区设置默认情况设置utc您更改系统设置意IANA时区例 EuropeAmsterdam(欧洲阿姆斯特丹)赖pendulum(摆锤)pendulumpytz更准确 您安装Airflow时会动安装pendulum
请注意AirflowWeb界面目前仅UTC时间调度job运行
概念
原始日期时间象
Pythondatetimedatetime象tzinfo属性该属性存储时区信息表示datetimetzinfo子类实例 设置属性描述偏移量会知道datetime象
您timezoneis_aware()timezoneis_naive()确定日期时间否
Because Airflow uses timezoneaware datetime objects If your code creates datetime objects they need to be aware too Airflow时区感知日期时间象果您代码创建datetime象需注意
from airflowutils import timezone

now timezoneutcnow()
a_date timezonedatetime(201711)
原始日期时间象解释
Although Airflow operates fully time zone aware it still accepts naive date time objects for start_dates and end_dates in your DAG definitions This is mostly in order to preserve backwards compatibility In case a naive start_date or end_date is encountered the default time zone is applied It is applied in such a way that it is assumed that the naive date time is already in the default time zone In other words if you have a default time zone setting of EuropeAmsterdam and create a naive datetime start_date of datetime(201711) it is assumed to be a start_date of Jan 1 2017 Amsterdam time Airflow完全识时区会DAG定义中接受start_datesend_dates原始日期时间象 保持兼容性 果遇天真起始日期结束日期会应默认时区 假定原始日期时间已默认时区中种方式应 换句话说果您默认时区设置设置Europe Amsterdam创建简单datetime start_date datetime(201711)假定start_date阿姆斯特丹时间2017年1月1日
default_argsdict(
start_datedatetime(2016 1 1)
owner'Airflow'
)

dag DAG('my_dag' default_argsdefault_args)
op DummyOperator(task_id'dummy' dagdag)
print(opowner) # Airflow
Unfortunately during DST transitions some datetimes don’t exist or are ambiguous In such situations pendulum raises an exception That’s why you should always create aware datetime objects when time zone support is enabled 幸DST渡期间某日期时间存明确 种情况摆会引发例外情况 启时区支持时您应该始终创建识日期时间象
In practice this is rarely an issue Airflow gives you aware datetime objects in the models and DAGs and most often new datetime objects are created from existing ones through timedelta arithmetic The only datetime that’s often created in application code is the current time and timezoneutcnow() automatically does the right thing 实际少问题 通Airflow您模型DAG中解日期时间象常见新日期时间象通timedelta算法现象中创建 应程序代码中唯常创建日期时间前时间timezoneutcnow()动执行正确操作
默认时区
The default time zone is the time zone defined by the default_timezone setting under [core] If you just installed Airflow it will be set to utc which is recommended You can also set it to system or an IANA time zone (eg`EuropeAmsterdam`) DAGs are also evaluated on Airflow workers it is therefore important to make sure this setting is equal on all Airflow nodes 默认时区[core]default_timezone设置定义时区 果您安装Airflow建议设置utc 您设置系统IANA时区(例欧洲阿姆斯特丹) Airflow工作员评估DAG确保Airflow节点设置均相等重
[core]
default_timezone utc
时区感知DAG
创建时区感知DAG非常简单 需确保提供时区感知start_date 建议pendulumpytz(手动安装)
import pendulum

local_tz pendulumtimezone(EuropeAmsterdam)

default_argsdict(
start_datedatetime(2016 1 1 tzinfolocal_tz)
owner'Airflow'
)

dag DAG('my_tz_dag' default_argsdefault_args)
op DummyOperator(task_id'dummy' dagdag)
print(dagtimezone) #
Please note that while it is possible to set a start_date and end_date for Tasks always the DAG timezone or global timezone (in that order) will be used to calculate the next execution date 请注意然Tasks设置start_dateend_dateDAG时区全局时区(序)计算执行日期 Upon first encounter the start date or end date will be converted to UTC using the timezone associated with start_date or end_date then for calculations this timezone information will be disregarded 首次遇时start_dateend_date关联时区开始日期结束日期转换UTC然计算时区信息忽略
模板
Airflow returns time zone aware datetimes in templates but does not convert them to local time so they remain in UTC It is left up to the DAG to handle this Airflow模板中返回时区感知日期时间会转换时间保持UTC DAG处理问题
import pendulum

local_tz pendulumtimezone(EuropeAmsterdam)
local_tzconvert(execution_date)
Cron时间表
In case you set a cron schedule Airflow assumes you will always want to run at the exact same time It will then ignore day light savings time Thus if you have a schedule that says run at the end of interval every day at 0800 GMT+1 it will always run at the end of interval 0800 GMT+1 regardless if day light savings time is in place 万您设置cron时间表Airflow会假设您始终希完全时间运行 然忽略夏令时 果您时间表说天0800 GMT + 1间隔时间结束时运行始终间隔0800 GMT + 1结束时间运行否设置夏时制
时间增量
For schedules with time deltas Airflow assumes you always will want to run with the specified interval So if you specify a timedelta(hours2) you will always want to run two hours later In this case day light savings time will be taken into account 具时间增量计划Airflow假设您始终希指定间隔运行 果您指定timedelta(hours 2)您总两时运行 种情况考虑日光节约时间
Experimental Rest API
Airflow exposes an experimental Rest API It is available through the webserver Endpoints are available at apiexperimental Please note that we expect the endpoint definitions to change 气流公开实验性Rest API 通Web服务器 端点位 api experimental 请注意希端点定义会发生变化
Endpoints
This is a place holder until the swagger definitions are active占位符直效定义生效
· apiexperimentaldagstasks returns info for a task (GET)
· apiexperimentaldagsdag_runs creates a dag_run for a given dag id (POST)
CLI
For some functions the cli can use the API To configure the CLI to use the API when available configure as follows 某功CLIAPI CLI配置时API请进行配置:
[cli]
api_client airflowapiclientjson_client
endpoint_url http
Authentication
Authentication for the API is handled separately to the Web Authentication The default is to not require any authentication on the API – ie wide open by default This is not recommended if your Airflow webserver is publicly accessible and you should probably use the deny all backend API身份验证Web身份验证分开处理 默认设置求API进行身份验证默认情况完全开放 果您公开访问您Airflow Web服务器建议样做您应该deny all backend:
[api]
auth_backend airflowapiauthbackenddeny_all
Two real methods for authentication are currently supported for the API 该API前支持两种真实身份验证方法
启密码验证请配置文件中设置容:
[api]
auth_backend airflowcontribauthbackendspassword_auth
It’s usage is similar to the Password Authentication used for the Web interface 法类似Web界面密码验证
To enable Kerberos authentication set the following in the configuration 启Kerberos身份验证请配置中设置容:
[api]
auth_backend airflowapiauthbackendkerberos_auth

[kerberos]
keytab
The Kerberos service is configured as airflowfullyqualifieddomainname@REALM Make sure this principal exists in the keytab file Kerberos服务配置airflowfullyqualifieddomainname@REALM 确保体存keytab文件中
整合
· Reverse Proxy
· Azure Microsoft Azure
· AWS Amazon Web Services
· Databricks
· GCP Google Cloud Platform
· Qubole
Reverse Proxy反代理
Airflow can be set up behind a reverse proxy with the ability to set its endpoint with great flexibility 反代理面建立气流够灵活设置端点
For example you can configure your reverse proxy to get 例您反代理配置:
httpslabmycompanycommyorgairflow
To do so you need to set the following setting in your airflowcfg 您需airflowcfg中设置设置:
base_url httpmy_hostmyorgairflow
Additionally if you use Celery Executor you can get Flower in myorgflower with 外果您Celery Executor通方式 myorg flower中获Flower:
flower_url_prefix myorgflower
Your reverse proxy (ex nginx) should be configured as follow 您反代理(例:nginx)应配置:
· pass the url and http header as it for the Airflow webserver without any rewrite for example URLhttp标头传递Airflow Web服务器进行重写例:
· server {
· listen 80
· server_name labmycompanycom
·
· location myorgairflow {
· proxy_pass httplocalhost8080
· proxy_set_header Host host
· proxy_redirect off
· proxy_http_version 11
· proxy_set_header Upgrade http_upgrade
· proxy_set_header Connection upgrade
· }
· }
· rewrite the url for the flower endpoint
· server {
· listen 80
· server_name labmycompanycom
·
· location myorgflower {
· rewrite ^myorgflower(*) 1 break # remove prefix from http header
· proxy_pass httplocalhost5555
· proxy_set_header Host host
· proxy_redirect off
· proxy_http_version 11
· proxy_set_header Upgrade http_upgrade
· proxy_set_header Connection upgrade
· }
· }
To ensure that Airflow generates URLs with the correct scheme when running behind a TLSterminating proxy you should configure the proxy to set the XForwardedProto header and enable the ProxyFix middleware in your airflowcfg
enable_proxy_fix True
Note you should only enable the ProxyFix middleware when running Airflow behind a trusted proxy (AWS ELB nginx etc)
Azure Microsoft Azure
Airflow has limited support for Microsoft Azure interfaces exist only for Azure Blob Storage and Azure Data Lake Hook Sensor and Operator for Blob Storage and Azure Data Lake Hook are in contrib section AirflowMicrosoft Azure支持限:接口仅适Azure Blob存储Azure Data Lake Blob存储Azure Data Lake Hook挂钩传感器操作员位贡献部分
Azure Blob Storage
All classes communicate via the Window Azure Storage Blob protocol Make sure that a Airflow connection of type wasb exists Authorization can be done by supplying a login (Storage account name) and password (KEY) or login and SAS token in the extra field (see connection wasb_default for an example) 类通Window Azure Storage Blob协议进行通信 确保存wasb类型气流连接 通额外字段中提供登录名(存储帐户名)密码( KEY)登录名SAS令牌完成授权(关示例请参见wasb_default)
· WasbBlobSensor Checks if a blob is present on Azure Blob storage
· WasbPrefixSensor Checks if blobs matching a prefix are present on Azure Blob storage
· FileToWasbOperator Uploads a local file to a container as a blob
· WasbHook Interface with Azure Blob Storage
WasbBlobSensor
WasbPrefixSensor
FileToWasbOperator
WasbHook
Azure File Share
Cloud variant of a SMB file share Make sure that a Airflow connection of type wasb exists Authorization can be done by supplying a login (Storage account name) and password (Storage account key) or login and SAS token in the extra field (see connection wasb_default for an example) SMB文件享云变体 确保存wasb类型气流连接 通额外字段中提供登录名(存储帐户名)密码(存储帐户密钥)登录名SAS令牌完成授权(关示例请参见connection wasb_default)
AzureFileShareHook
Logging
Airflow can be configured to read and write task logs in Azure Blob Storage See Writing Logs to Azure Blob Storage 气流配置Azure Blob存储中读写务日志 请参阅日志写入Azure Blob存储
Azure Data Lake
AzureDataLakeHook communicates via a REST API compatible with WebHDFS Make sure that a Airflow connection of type azure_data_lake exists Authorization can be done by supplying a login (Client ID) password (Client Secret) and extra fields tenant (Tenant) and account_name (Account Name) AzureDataLakeHook通WebHDFS兼容REST API进行通信 确保存azure_data_lake类型Airflow连接 通提供登录名(客户ID)密码(客户机密)额外字段租户(Tenant)account_name(帐户名)完成授权
(see connection azure_data_lake_default for an example) (关示例请参见连接azure_data_lake_default)
· AzureDataLakeHook Interface with Azure Data Lake AzureDataLakeHook:Azure Data Lake接口
AzureDataLakeHook
AWS Amazon Web Services
Airflow has extensive support for Amazon Web Services But note that the Hooks Sensors and Operators are in the contrib section AirflowAmazon Web Services具广泛支持 请注意挂钩传感器操作员贡献部分中
AWS EMR
· EmrAddStepsOperator Adds steps to an existing EMR JobFlow
· EmrCreateJobFlowOperator Creates an EMR JobFlow reading the config from the EMR connection
· EmrTerminateJobFlowOperator Terminates an EMR JobFlow
· EmrHook Interact with AWS EMR
EmrAddStepsOperator
class airflowcontriboperatorsemr_add_steps_operatorEmrAddStepsOperator(job_flow_id aws_conn_id's3_default' stepsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
An operator that adds steps to an existing EMR job_flow 步骤添加现EMR job_flow运算符
Parameters
· job_flow_id (str) – id of the JobFlow to add steps to (templated)
· aws_conn_id (str) – aws connection to uses
· steps (list) – boto3 style steps to be added to the jobflow (templated)
EmrCreateJobFlowOperator
class airflowcontriboperatorsemr_create_job_flow_operatorEmrCreateJobFlowOperator(aws_conn_id's3_default' emr_conn_id'emr_default' job_flow_overridesNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates an EMR JobFlow reading the config from the EMR connection A dictionary of JobFlow overrides can be passed that override the config from the connection 创建EMR JobFlowEMR连接中读取配置 传递JobFlow代字典该字典覆盖连接中配置
Parameters
· aws_conn_id (str) – aws connection to uses
· emr_conn_id (str) – emr connection to use
· job_flow_overrides – boto3 style arguments to override emr_connection extra (templated)
EmrTerminateJobFlowOperator
class airflowcontriboperatorsemr_terminate_job_flow_operatorEmrTerminateJobFlowOperator(job_flow_id aws_conn_id's3_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator to terminate EMR JobFlows
Parameters
· job_flow_id (str) – id of the JobFlow to terminate (templated)
· aws_conn_id (str) – aws connection to uses
EmrHook
class airflowcontribhooksemr_hookEmrHook(emr_conn_idNone *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS EMR emr_conn_id is only neccessary for using the create_job_flow method AWS EMR进行交互 仅create_job flow方法时需emr conn_id
create_job_flow(job_flow_overrides)[source]
Creates a job flow using the config from the EMR connection Keys of the json extra hash may have the arguments of the boto3 run_job_flow method Overrides for this config may be passed as the job_flow_overrides EMR连接中配置创建作业流程 json Extra哈希键具boto3 run_job_flow方法参数 配置代作job_flow_overrides传递
AWS S3
· S3Hook Interact with AWS S3
· S3FileTransformOperator Copies data from a source S3 location to a temporary location on the local filesystem
· S3ListOperator Lists the files matching a key prefix from a S3 location
· S3ToGoogleCloudStorageOperator Syncs an S3 location with a Google Cloud Storage bucket
· S3ToHiveTransfer Moves data from S3 to Hive The operator downloads a file from S3 stores the file locally before loading it into a Hive table
S3Hook
class airflowhooksS3_hookS3Hook(aws_conn_id'aws_default' verifyNone)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS S3 using the boto3 library
check_for_bucket(bucket_name)[source]
Check if bucket_name exists
Parameters
bucket_name (str) – the name of the bucket
check_for_key(key bucket_nameNone)[source]
Checks if a key exists in a bucket检查存储桶中否存密钥
Parameters
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which the file is stored
check_for_prefix(bucket_name prefix delimiter)[source]
Checks that a prefix exists in a bucket检查存储桶中否存前缀
Parameters
· bucket_name (str) – the name of the bucket
· prefix (str) – a key prefix
· delimiter (str) – the delimiter marks key hierarchy
check_for_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
Checks that a key matching a wildcard expression exists in a bucket检查存储桶中否存通配符表达式匹配键
Parameters
· wildcard_key (str) – the path to the key
· bucket_name (str) – the name of the bucket
· delimiter (str) – the delimiter marks key hierarchy
copy_object(source_bucket_key dest_bucket_key source_bucket_nameNone dest_bucket_nameNone source_version_idNone)[source]
Creates a copy of an object that is already stored in S3 创建已存储S3中象副
Note the S3 connection used here needs to have access to both source and destination bucketkey 注意:处S3连接需访问源存储桶目标存储桶密钥
Parameters
· source_bucket_key (str) –
The key of the source object
It can be either full s3 style url or relative path from root level 完整s3:样式url根级相路径
When it’s specified as a full s3 url please omit source_bucket_name 果指定完整s3:网址请省略source_bucket_name
· dest_bucket_key (str) –
The key of the object to copy to 复制象键
The convention to specify dest_bucket_key is the same as source_bucket_key 指定dest_bucket_key约定source_bucket_key相
· source_bucket_name (str) –
Name of the S3 bucket where the source object is in 源象S3存储桶名称
It should be omitted when source_bucket_key is provided as a full s3 url source_bucket_key作完整s3: URL提供时应省略
· dest_bucket_name (str) –
Name of the S3 bucket to where the object is copied 象复制S3存储桶名称
It should be omitted when dest_bucket_key is provided as a full s3 url dest_bucket_key作完整s3: URL提供时应省略
· source_version_id (str) – Version ID of the source object (OPTIONAL)
delete_objects(bucket keys)[source]
Parameters
· bucket (str) – Name of the bucket in which you are going to delete object(s)
· keys (str or list) –
The key(s) to delete from S3 bucket
When keys is a string it’s supposed to be the key name of the single object to delete
When keys is a list it’s supposed to be the list of the keys to delete
get_bucket(bucket_name)[source]
Returns a boto3S3Bucket object
Parameters
bucket_name (str) – the name of the bucket
get_key(key bucket_nameNone)[source]
Returns a boto3s3Object
Parameters
· key (str) – the path to the key
· bucket_name (str) – the name of the bucket
get_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
Returns a boto3s3Object object matching the wildcard expression
Parameters
· wildcard_key (str) – the path to the key
· bucket_name (str) – the name of the bucket
· delimiter (str) – the delimiter marks key hierarchy
list_keys(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
Lists keys in a bucket under prefix and not containing delimiter
Parameters
· bucket_name (str) – the name of the bucket
· prefix (str) – a key prefix
· delimiter (str) – the delimiter marks key hierarchy
· page_size (int) – pagination size
· max_items (int) – maximum items to return
list_prefixes(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
Lists prefixes in a bucket under prefix
Parameters
· bucket_name (str) – the name of the bucket
· prefix (str) – a key prefix
· delimiter (str) – the delimiter marks key hierarchy
· page_size (int) – pagination size
· max_items (int) – maximum items to return
load_bytes(bytes_data key bucket_nameNone replaceFalse encryptFalse)[source]
Loads bytes to S3
This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
Parameters
· bytes_data (bytes) – bytes to set as content for the key
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which to store the file
· replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
· encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
load_file(filename key bucket_nameNone replaceFalse encryptFalse)[source]
Loads a local file to S3
Parameters
· filename (str) – name of the file to load
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which to store the file
· replace (bool) – A flag to decide whether or not to overwrite the key if it already exists If replace is False and the key exists an error will be raised
· encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
load_string(string_data key bucket_nameNone replaceFalse encryptFalse encoding'utf8')[source]
Loads a string to S3
This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
Parameters
· string_data (str) – str to set as content for the key
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which to store the file
· replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
· encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
read_key(key bucket_nameNone)[source]
Reads a key from S3
Parameters
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which the file is stored
select_key(key bucket_nameNone expression'SELECT * FROM S3Object' expression_type'SQL' input_serializationNone output_serializationNone)[source]
Reads a key with S3 Select
Parameters
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which the file is stored
· expression (str) – S3 Select expression
· expression_type (str) – S3 Select expression type
· input_serialization (dict) – S3 Select input data serialization format
· output_serialization (dict) – S3 Select output data serialization format
Returns
retrieved subset of original data by S3 Select
Return type
str
See also
For more details about S3 Select parameters httpboto3readthedocsioenlatestreferenceservicess3html#S3Clientselect_object_content
S3FileTransformOperator
class airflowoperatorss3_file_transform_operatorS3FileTransformOperator(source_s3_key dest_s3_key transform_scriptNone select_expressionNone source_aws_conn_id'aws_default' source_verifyNone dest_aws_conn_id'aws_default' dest_verifyNone replaceFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem Runs a transformation on this file as specified by the transformation script and uploads the output to a destination S3 location
The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script The transformation script is expected to read the data from source transform it and write the output to the local destination file The operator then takes over control and uploads the local destination file to S3
S3 Select is also available to filter the source contents Users can omit the transformation script if S3 Select expression is specified
Parameters
· source_s3_key (str) – The key to be retrieved from S3 (templated)
· source_aws_conn_id (str) – source s3 connection
· source_verify (bool or str) –
Whether or not to verify SSL certificates for S3 connetion By default SSL certificates are verified You can provide the following values
o False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
o pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
This is also applicable to dest_verify
· dest_s3_key (str) – The key to be written from S3 (templated)
· dest_aws_conn_id (str) – destination s3 connection
· replace (bool) – Replace dest S3 key if it already exists
· transform_script (str) – location of the executable transformation script
· select_expression (str) – S3 Select expression
S3ListOperator
class airflowcontriboperatorss3_list_operatorS3ListOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
List all objects from the bucket with the given string prefix in name
This operator returns a python list with the name of objects which can be used by xcom in the downstream task
Parameters
· bucket (str) – The S3 bucket where to find the objects (templated)
· prefix (str) – Prefix string to filters the objects whose name begin with such prefix (templated)
· delimiter (str) – the delimiter marks key hierarchy (templated)
· aws_conn_id (str) – The connection ID to use when connecting to S3 storage
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
Example
The following operator would list all the files (excluding subfolders) from the S3 customers201804 key in the data bucket
s3_file S3ListOperator(
task_id'list_3s_files'
bucket'data'
prefix'customers201804'
delimiter''
aws_conn_id'aws_customers_conn'
)
S3ToGoogleCloudStorageOperator
class airflowcontriboperatorss3_to_gcs_operatorS3ToGoogleCloudStorageOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone dest_gcs_conn_idNone dest_gcsNone delegate_toNone replaceFalse *args **kwargs)[source]
Bases airflowcontriboperatorss3_list_operatorS3ListOperator
Synchronizes an S3 key possibly a prefix with a Google Cloud Storage destination path
Parameters
· bucket (str) – The S3 bucket where to find the objects (templated)
· prefix (str) – Prefix string which filters objects whose name begin with such prefix (templated)
· delimiter (str) – the delimiter marks key hierarchy (templated)
· aws_conn_id (str) – The source S3 connection
· dest_gcs_conn_id (str) – The destination connection ID to use when connecting to Google Cloud Storage
· dest_gcs (str) – The destination Google Cloud Storage bucket and prefix where you want to store the files (templated)
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· replace (bool) – Whether you want to replace existing destination files or not
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
Example
s3_to_gcs_op S3ToGoogleCloudStorageOperator(
task_id's3_to_gcs_example'
bucket'mys3bucket'
prefix'datacustomers201804'
dest_gcs_conn_id'google_cloud_default'
dest_gcs'gsmygcsbucketsomecustomers'
replaceFalse
dagmydag)
Note that bucket prefix delimiter and dest_gcs are templated so you can use variables in them if you wish
S3ToHiveTransfer
class airflowoperatorss3_to_hive_operatorS3ToHiveTransfer(s3_key field_dict hive_table delimiter' ' createTrue recreateFalse partitionNone headersFalse check_headersFalse wildcard_matchFalse aws_conn_id'aws_default' verifyNone hive_cli_conn_id'hive_cli_default' input_compressedFalse tblpropertiesNone select_expressionNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from S3 to Hive The operator downloads a file from S3 stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata from
Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the tables gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
Parameters
· s3_key (str) – The key to be retrieved from S3 (templated)
· field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values
· hive_table (str) – target Hive table use dot notation to target a specific database (templated)
· create (bool) – whether to create the table if it doesn’t exist
· recreate (bool) – whether to drop and recreate the table at every execution
· partition (dict) – target partition as a dict of partition columns and values (templated)
· headers (bool) – whether the file contains column names on the first line
· check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict
· wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern
· delimiter (str) – field delimiter in the file
· aws_conn_id (str) – source s3 connection
· hive_cli_conn_id (str) – destination hive connection
· input_compressed (bool) – Boolean to determine if file decompression is required to process headers
· tblproperties (dict) – TBLPROPERTIES of the hive table being created
· select_expression (str) – S3 Select expression
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
AWS EC2 Container Service
· ECSOperator Execute a task on AWS EC2 Container Service
ECSOperator
class airflowcontriboperatorsecs_operatorECSOperator(task_definition cluster overrides aws_conn_idNone region_nameNone launch_type'EC2' groupNone placement_constraintsNone platform_version'LATEST' network_configurationNone **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a task on AWS EC2 Container Service
Parameters
· task_definition (str) – the task definition name on EC2 Container Service
· cluster (str) – the cluster name on EC2 Container Service
· overrides (dict) – the same parameter that boto3 will receive (templated) httpboto3readthedocsorgenlatestreferenceservicesecshtml#ECSClientrun_task
· aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
· region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
· launch_type (str) – the launch type on which to run your task (EC2’ or FARGATE’)
· group (str) – the name of the task group associated with the task
· placement_constraints (list) – an array of placement constraint objects to use for the task
· platform_version (str) – the platform version on which your task is running
· network_configuration (dict) – the network configuration for the task
AWS Batch Service
· AWSBatchOperator Execute a task on AWS Batch Service
AWSBatchOperator
class airflowcontriboperatorsawsbatch_operatorAWSBatchOperator(job_name job_definition job_queue overrides max_retries4200 aws_conn_idNone region_nameNone **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a job on AWS Batch Service
Parameters
· job_name (str) – the name for the job that will run on AWS Batch (templated)
· job_definition (str) – the job definition name on AWS Batch
· job_queue (str) – the queue name on AWS Batch
· overrides (dict) – the same parameter that boto3 will receive on containerOverrides (templated) httpboto3readthedocsioenlatestreferenceservicesbatchhtml#submit_job
· max_retries (int) – exponential backoff retries while waiter is not merged 4200 48 hours
· aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
· region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
AWS RedShift
· AwsRedshiftClusterSensor Waits for a Redshift cluster to reach a specific status
· RedshiftHook Interact with AWS Redshift using the boto3 library
· RedshiftToS3Transfer Executes an unload command to S3 as CSV with or without headers
· S3ToRedshiftTransfer Executes an copy command from S3 as CSV with or without headers
AwsRedshiftClusterSensor
class airflowcontribsensorsaws_redshift_cluster_sensorAwsRedshiftClusterSensor(cluster_identifier target_status'available' aws_conn_id'aws_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a Redshift cluster to reach a specific status
Parameters
· cluster_identifier (str) – The identifier for the cluster being pinged
· target_status (str) – The cluster status desired
poke(context)[source]
Function that the sensors defined while deriving this class should override
RedshiftHook
class airflowcontribhooksredshift_hookRedshiftHook(aws_conn_id'aws_default' verifyNone)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS Redshift using the boto3 library
cluster_status(cluster_identifier)[source]
Return status of a cluster
Parameters
cluster_identifier (str) – unique identifier of a cluster
create_cluster_snapshot(snapshot_identifier cluster_identifier)[source]
Creates a snapshot of a cluster
Parameters
· snapshot_identifier (str) – unique identifier for a snapshot of a cluster
· cluster_identifier (str) – unique identifier of a cluster
delete_cluster(cluster_identifier skip_final_cluster_snapshotTrue final_cluster_snapshot_identifier'')[source]
Delete a cluster and optionally create a snapshot
Parameters
· cluster_identifier (str) – unique identifier of a cluster
· skip_final_cluster_snapshot (bool) – determines cluster snapshot creation
· final_cluster_snapshot_identifier (str) – name of final cluster snapshot
describe_cluster_snapshots(cluster_identifier)[source]
Gets a list of snapshots for a cluster
Parameters
cluster_identifier (str) – unique identifier of a cluster
restore_from_cluster_snapshot(cluster_identifier snapshot_identifier)[source]
Restores a cluster from its snapshot
Parameters
· cluster_identifier (str) – unique identifier of a cluster
· snapshot_identifier (str) – unique identifier for a snapshot of a cluster
RedshiftToS3Transfer
class airflowoperatorsredshift_to_s3_operatorRedshiftToS3Transfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone unload_options() autocommitFalse include_headerFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes an UNLOAD command to s3 as a CSV with headers
Parameters
· schema (str) – reference to a specific schema in redshift database
· table (str) – reference to a specific table in redshift database
· s3_bucket (str) – reference to a specific S3 bucket
· s3_key (str) – reference to a specific S3 key
· redshift_conn_id (str) – reference to a specific redshift database
· aws_conn_id (str) – reference to a specific S3 connection
· unload_options (list) – reference to a list of UNLOAD options
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values
· False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
S3ToRedshiftTransfer
class airflowoperatorss3_to_redshift_operatorS3ToRedshiftTransfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone copy_options() autocommitFalse parametersNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes an COPY command to load files from s3 to Redshift
Parameters
· schema (str) – reference to a specific schema in redshift database
· table (str) – reference to a specific table in redshift database
· s3_bucket (str) – reference to a specific S3 bucket
· s3_key (str) – reference to a specific S3 key
· redshift_conn_id (str) – reference to a specific redshift database
· aws_conn_id (str) – reference to a specific S3 connection
· copy_options (list) – reference to a list of COPY options
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
AWS DynamoDB
· HiveToDynamoDBTransferOperator Moves data from Hive to DynamoDB
· AwsDynamoDBHook Interact with AWS DynamoDB
HiveToDynamoDBTransferOperator
class airflowcontriboperatorshive_to_dynamodbHiveToDynamoDBTransferOperator(sql table_name table_keys pre_processNone pre_process_argsNone pre_process_kwargsNone region_nameNone schema'default' hiveserver2_conn_id'hiveserver2_default' aws_conn_id'aws_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Hive to DynamoDB note that for now the data is loaded into memory before being pushed to DynamoDB so this operator should be used for smallish amount of data
Parameters
· sql (str) – SQL query to execute against the hive database (templated)
· table_name (str) – target DynamoDB table
· table_keys (list) – partition key and sort key
· pre_process (function) – implement preprocessing of source data
· pre_process_args (list) – list of pre_process function arguments
· pre_process_kwargs (dict) – dict of pre_process function arguments
· region_name (str) – aws region name (example useast1)
· schema (str) – hive database schema
· hiveserver2_conn_id (str) – source hive connection
· aws_conn_id (str) – aws connection
AwsDynamoDBHook
class airflowcontribhooksaws_dynamodb_hookAwsDynamoDBHook(table_keysNone table_nameNone region_nameNone *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS DynamoDB
Parameters
· table_keys (list) – partition key and sort key
· table_name (str) – target DynamoDB table
· region_name (str) – aws region name (example useast1)
write_batch_data(items)[source]
Write batch items to dynamodb table with provisioned throughout capacity
AWS Lambda
· AwsLambdaHook Interact with AWS Lambda
AwsLambdaHook
class airflowcontribhooksaws_lambda_hookAwsLambdaHook(function_name region_nameNone log_type'None' qualifier'LATEST' invocation_type'RequestResponse' *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS Lambda
Parameters
· function_name (str) – AWS Lambda Function Name
· region_name (str) – AWS Region Name (example uswest2)
· log_type (str) – Tail Invocation Request
· qualifier (str) – AWS Lambda Function Version or Alias Name
· invocation_type (str) – AWS Lambda Invocation Type (RequestResponse Event etc)
invoke_lambda(payload)[source]
Invoke Lambda Function
AWS Kinesis
· AwsFirehoseHook Interact with AWS Kinesis Firehose
AwsFirehoseHook
class airflowcontribhooksaws_firehose_hookAwsFirehoseHook(delivery_stream region_nameNone *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS Kinesis Firehose param delivery_stream Name of the delivery stream type delivery_stream str param region_name AWS region name (example useast1) type region_name str
get_conn()[source]
Returns AwsHook connection object
put_records(records)[source]
Write batch records to Kinesis Firehose
Databricks
Databricks has contributed an Airflow operator which enables submitting runs to the Databricks platform Internally the operator talks to the api20jobsrunssubmit endpoint
DatabricksSubmitRunOperator
class airflowcontriboperatorsdatabricks_operatorDatabricksSubmitRunOperator(jsonNone spark_jar_taskNone notebook_taskNone new_clusterNone existing_cluster_idNone librariesNone run_nameNone timeout_secondsNone databricks_conn_id'databricks_default' polling_period_seconds30 databricks_retry_limit3 databricks_retry_delay1 do_xcom_pushFalse **kwargs)[source]
Bases airflowmodelsBaseOperator
Submits a Spark job run to Databricks using the api20jobsrunssubmit API endpoint
There are two ways to instantiate this operator
In the first way you can take the JSON payload that you typically use to call the api20jobsrunssubmit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json parameter For example
json {
'new_cluster' {
'spark_version' '210db3scala211'
'num_workers' 2
}
'notebook_task' {
'notebook_path' 'Usersairflow@examplecomPrepareData'
}
}
notebook_run DatabricksSubmitRunOperator(task_id'notebook_run' jsonjson)
Another way to accomplish the same thing is to use the named parameters of the DatabricksSubmitRunOperator directly Note that there is exactly one named parameter for each top level parameter in the runssubmit endpoint In this method your code would look like this
new_cluster {
'spark_version' '210db3scala211'
'num_workers' 2
}
notebook_task {
'notebook_path' 'Usersairflow@examplecomPrepareData'
}
notebook_run DatabricksSubmitRunOperator(
task_id'notebook_run'
new_clusternew_cluster
notebook_tasknotebook_task)
In the case where both the json parameter AND the named parameters are provided they will be merged together If there are conflicts during the merge the named parameters will take precedence and override the top level json keys
Currently the named parameters that DatabricksSubmitRunOperator supports are
· spark_jar_task
· notebook_task
· new_cluster
· existing_cluster_id
· libraries
· run_name
· timeout_seconds
Parameters
· json (dict) –
A JSON object containing API parameters which will be passed directly to the api20jobsrunssubmit endpoint The other named parameters (ie spark_jar_task notebook_task) to this operator will be merged with this json dictionary if they are provided If there are conflicts during the merge the named parameters will take precedence and override the top level json keys (templated)
See also
For more information about templating see Jinja Templating httpsdocsdatabrickscomapilatestjobshtml#runssubmit
· spark_jar_task (dict) –
The main class and parameters for the JAR task Note that the actual JAR is specified in the libraries EITHER spark_jar_task OR notebook_task should be specified This field will be templated
See also
httpsdocsdatabrickscomapilatestjobshtml#jobssparkjartask
· notebook_task (dict) –
The notebook path and parameters for the notebook task EITHER spark_jar_task OR notebook_task should be specified This field will be templated
See also
httpsdocsdatabrickscomapilatestjobshtml#jobsnotebooktask
· new_cluster (dict) –
Specs for a new cluster on which this task will be run EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
See also
httpsdocsdatabrickscomapilatestjobshtml#jobsclusterspecnewcluster
· existing_cluster_id (str) – ID for existing cluster on which to run this task EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
· libraries (list of dicts) –
Libraries which this run will use This field will be templated
See also
httpsdocsdatabrickscomapilatestlibrarieshtml#managedlibrarieslibrary
· run_name (str) – The run name used for this task By default this will be set to the Airflow task_id This task_id is a required parameter of the superclass BaseOperator This field will be templated
· timeout_seconds (int32) – The timeout for this run By default a value of 0 is used which means to have no timeout This field will be templated
· databricks_conn_id (str) – The name of the Airflow connection to use By default and in the common case this will be databricks_default To use token based authentication provide the key token in the extra field for the connection
· polling_period_seconds (int) – Controls the rate which we poll for the result of this run By default the operator will poll every 30 seconds
· databricks_retry_limit (int) – Amount of times retry if the Databricks backend is unreachable Its value must be greater than or equal to 1
· databricks_retry_delay (float) – Number of seconds to wait between retries (it might be a floating point number)
· do_xcom_push (bool) – Whether we should push run_id and run_page_url to xcom
GCP Google Cloud Platform
Airflow has extensive support for the Google Cloud Platform But note that most Hooks and Operators are in the contrib section Meaning that they have a beta status meaning that they can have breaking changes between minor releases
See the GCP connection type documentation to configure connections to GCP
Logging
Airflow can be configured to read and write task logs in Google Cloud Storage See Writing Logs to Google Cloud Storage
BigQuery
BigQuery Operators
· BigQueryCheckOperator Performs checks against a SQL query that will return a single row with different values
· BigQueryValueCheckOperator Performs a simple value check using SQL code
· BigQueryIntervalCheckOperator Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
· BigQueryGetDataOperator Fetches the data from a BigQuery table and returns data in a python list
· BigQueryCreateEmptyTableOperator Creates a new empty table in the specified BigQuery dataset optionally with schema
· BigQueryCreateExternalTableOperator Creates a new external table in the dataset with the data in Google Cloud Storage
· BigQueryDeleteDatasetOperator Deletes an existing BigQuery dataset
· BigQueryCreateEmptyDatasetOperator Creates an empty BigQuery dataset
· BigQueryOperator Executes BigQuery SQL queries in a specific BigQuery database
· BigQueryToBigQueryOperator Copy a BigQuery table to another BigQuery table
· BigQueryToCloudStorageOperator Transfers a BigQuery table to a Google Cloud Storage bucket
BigQueryCheckOperator
class airflowcontriboperatorsbigquery_check_operatorBigQueryCheckOperator(sql bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
Bases airflowoperatorscheck_operatorCheckOperator
Performs checks against BigQuery The BigQueryCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
Note that Python bool casting evals the following as False
· False
· 0
· Empty string ()
· Empty list ([])
· Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
Parameters
· sql (str) – the sql to be executed
· bigquery_conn_id (str) – reference to the BigQuery database
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
BigQueryValueCheckOperator
class airflowcontriboperatorsbigquery_check_operatorBigQueryValueCheckOperator(sql pass_value toleranceNone bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
Bases airflowoperatorscheck_operatorValueCheckOperator
Performs a simple value check using sql code
Parameters
· sql (str) – the sql to be executed
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
BigQueryIntervalCheckOperator
class airflowcontriboperatorsbigquery_check_operatorBigQueryIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
Bases airflowoperatorscheck_operatorIntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
This method constructs a query like so
SELECT {metrics_threshold_dict_key} FROM {table}
WHERE {date_filter_column}
Parameters
· table (str) – the table name
· days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
· metrics_threshold (dict) – a dictionary of ratios indexed by metrics for example COUNT(*)’ 15 would require a 50 percent or less difference between the current day and the prior days_back
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
BigQueryGetDataOperator
class airflowcontriboperatorsbigquery_get_dataBigQueryGetDataOperator(dataset_id table_id max_results'100' selected_fieldsNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list The number of elements in the returned list will be equal to the number of rows fetched Each element in the list will again be a list where element would represent the columns values for that row
Example Result [['Tony' '10'] ['Mike' '20'] ['Steve' '15']]
Note
If you pass fields to selected_fields which are in different order than the order of columns already in BQ table the data will still be in the order of BQ table For example if the BQ table has 3 columns as [ABC] and you pass BA’ in the selected_fields the data would still be of the form 'AB'
Example
get_data BigQueryGetDataOperator(
task_id'get_data_from_bq'
dataset_id'test_dataset'
table_id'Transaction_partitions'
max_results'100'
selected_fields'DATE'
bigquery_conn_id'airflowserviceaccount'
)
Parameters
· dataset_id (str) – The dataset ID of the requested table (templated)
· table_id (str) – The table ID of the requested table (templated)
· max_results (str) – The maximum number of records (rows) to be fetched from the table (templated)
· selected_fields (str) – List of fields to return (commaseparated) If unspecified all fields are returned
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
BigQueryCreateEmptyTableOperator
class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyTableOperator(dataset_id table_id project_idNone schema_fieldsNone gcs_schema_objectNone time_partitioningNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a new empty table in the specified BigQuery dataset optionally with schema
The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it You can also create a table without schema
Parameters
· project_id (str) – The project to create the table into (templated)
· dataset_id (str) – The dataset to create the table into (templated)
· table_id (str) – The Name of the table to be created (templated)
· schema_fields (list) –
If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
Example
schema_fields[{name emp_name type STRING mode REQUIRED}
{name salary type INTEGER mode NULLABLE}]
· gcs_schema_object (str) – Full path to the JSON file containing schema (templated) For example gstestbucketdir1dir2employee_schemajson
· time_partitioning (dict) –
configure optional time partitioning fields ie partition by field type and expiration as per API specifications
See also
httpscloudgooglecombigquerydocsreferencerestv2tables#timePartitioning
· bigquery_conn_id (str) – Reference to a specific BigQuery hook
· google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· labels (dict) –
a dictionary containing labels for the table passed to BigQuery
Example (with schema JSON in GCS)
CreateTable BigQueryCreateEmptyTableOperator(
task_id'BigQueryCreateEmptyTableOperator_task'
dataset_id'ODS'
table_id'Employees'
project_id'internalgcpproject'
gcs_schema_object'gsschemabucketemployee_schemajson'
bigquery_conn_id'airflowserviceaccount'
google_cloud_storage_conn_id'airflowserviceaccount'
)
Corresponding Schema file (employee_schemajson)
[
{
mode NULLABLE
name emp_name
type STRING
}
{
mode REQUIRED
name salary
type INTEGER
}
]
Example (with schema in the DAG)
CreateTable BigQueryCreateEmptyTableOperator(
task_id'BigQueryCreateEmptyTableOperator_task'
dataset_id'ODS'
table_id'Employees'
project_id'internalgcpproject'
schema_fields[{name emp_name type STRING mode REQUIRED}
{name salary type INTEGER mode NULLABLE}]
bigquery_conn_id'airflowserviceaccount'
google_cloud_storage_conn_id'airflowserviceaccount'
)
BigQueryCreateExternalTableOperator
class airflowcontriboperatorsbigquery_operatorBigQueryCreateExternalTableOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' skip_leading_rows0 field_delimiter' ' max_bad_records0 quote_characterNone allow_quoted_newlinesFalse allow_jagged_rowsFalse bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone src_fmt_configs{} labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage
The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
Parameters
· bucket (str) – The bucket to point the external table to (templated)
· source_objects (list) – List of Google cloud storage URIs to point table to (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
· destination_project_dataset_table (str) – The dotted () BigQuery table to load data into (templated) If is not included project will be the project defined in the connection json
· schema_fields (list) –
If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
Example
schema_fields[{name emp_name type STRING mode REQUIRED}
{name salary type INTEGER mode NULLABLE}]
Should not be set when source_format is DATASTORE_BACKUP’
· schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
· source_format (str) – File format of the data
· compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
· skip_leading_rows (int) – Number of rows to skip when loading from a CSV
· field_delimiter (str) – The delimiter to use for the CSV
· max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
· quote_character (str) – The value that is used to quote data sections in a CSV file
· allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
· allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
· bigquery_conn_id (str) – Reference to a specific BigQuery hook
· google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· src_fmt_configs (dict) – configure optional fields specific to the source format
param labels a dictionary containing labels for the table passed to BigQuery type labels dict
BigQueryDeleteDatasetOperator
class airflowcontriboperatorsbigquery_operatorBigQueryDeleteDatasetOperator(dataset_id project_idNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
This operator deletes an existing dataset from your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasetsdelete param project_id The project id of the dataset type project_id str param dataset_id The dataset to be deleted type dataset_id str
Example
delete_temp_data BigQueryDeleteDatasetOperator(dataset_id 'tempdataset'
project_id 'tempproject'
bigquery_conn_id'_my_gcp_conn_'
task_id'Deletetemp'
dagdag)
BigQueryCreateEmptyDatasetOperator
class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyDatasetOperator(dataset_id project_idNone dataset_referenceNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
This operator is used to create new dataset for your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
Parameters
· project_id (str) – The name of the project where we want to create the dataset Don’t need to provide if projectId in dataset_reference
· dataset_id (str) – The id of dataset Don’t need to provide if datasetId in dataset_reference
· dataset_reference – Dataset reference that could be provided with request body More info httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
BigQueryOperator
class airflowcontriboperatorsbigquery_operatorBigQueryOperator(sqlNone destination_dataset_tableFalse write_disposition'WRITE_EMPTY' allow_large_resultsFalse flatten_resultsNone bigquery_conn_id'bigquery_default' delegate_toNone udf_configFalse use_legacy_sqlTrue maximum_billing_tierNone maximum_bytes_billedNone create_disposition'CREATE_IF_NEEDED' schema_update_options() query_paramsNone labelsNone priority'INTERACTIVE' time_partitioningNone api_resource_configsNone cluster_fieldsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes BigQuery SQL queries in a specific BigQuery database
Parameters
· sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
· destination_dataset_table (str) – A dotted (|)
that if set will store the results of the query (templated)
· write_disposition (str) – Specifies the action that occurs if the destination table already exists (default WRITE_EMPTY’)
· create_disposition (str) – Specifies whether the job is allowed to create new tables (default CREATE_IF_NEEDED’)
· allow_large_results (bool) – Whether to allow large results
· flatten_results (bool) – If true and query uses legacy SQL dialect flattens all nested and repeated fields in the query results allow_large_results must be true if this is set to false For standard SQL queries this flag is ignored and results are never flattened
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· udf_config (list) – The User Defined Function configuration for the query See httpscloudgooglecombigqueryuserdefinedfunctions for details
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
· maximum_billing_tier (int) – Positive integer that serves as a multiplier of the basic price Defaults to None in which case it uses the value set in the project
· maximum_bytes_billed (float) – Limits the bytes billed for this job Queries that will have bytes billed beyond this limit will fail (without incurring a charge) If unspecified this will be set to your project default
· api_resource_configs (dict) – a dictionary that contain params configuration’ applied for Google BigQuery Jobs API httpscloudgooglecombigquerydocsreferencerestv2jobs for example {query’ {useQueryCache’ False}} You could use it if you need to provide some params that are not supported by BigQueryOperator like args
· schema_update_options (tuple) – Allows the schema of the destination table to be updated as a side effect of the load job
· query_params (dict) – a dictionary containing query parameter types and values passed to BigQuery
· labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
· priority (str) – Specifies a priority for the query Possible values include INTERACTIVE and BATCH The default value is INTERACTIVE
· time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications
· cluster_fields (list of str) – Request that the result of this query be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order
BigQueryTableDeleteOperator
class airflowcontriboperatorsbigquery_table_delete_operatorBigQueryTableDeleteOperator(deletion_dataset_table bigquery_conn_id'bigquery_default' delegate_toNone ignore_if_missingFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Deletes BigQuery tables
Parameters
· deletion_dataset_table (str) – A dotted (|)
that indicates which table will be deleted (templated)
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· ignore_if_missing (bool) – if True then return success even if the requested table does not exist
BigQueryToBigQueryOperator
class airflowcontriboperatorsbigquery_to_bigqueryBigQueryToBigQueryOperator(source_project_dataset_tables destination_project_dataset_table write_disposition'WRITE_EMPTY' create_disposition'CREATE_IF_NEEDED' bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copies data from one BigQuery table to another
See also
For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs#configurationcopy
Parameters
· source_project_dataset_tables (list|string) – One or more dotted (project|project)
BigQuery tables to use as the source data If is not included project will be the project defined in the connection json Use a list if there are multiple source tables (templated)
· destination_project_dataset_table (str) – The destination BigQuery table Format is (project|project)
(templated)
· write_disposition (str) – The write disposition if the table already exists
· create_disposition (str) – The create disposition if the table doesn’t exist
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
BigQueryToCloudStorageOperator
class airflowcontriboperatorsbigquery_to_gcsBigQueryToCloudStorageOperator(source_project_dataset_table destination_cloud_storage_uris compression'NONE' export_format'CSV' field_delimiter' ' print_headerTrue bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket
See also
For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs
Parameters
· source_project_dataset_table (str) – The dotted (|)
BigQuery table to use as the source data If is not included project will be the project defined in the connection json (templated)
· destination_cloud_storage_uris (list) – The destination Google Cloud Storage URI (eg gssomebucketsomefiletxt) (templated) Follows convention defined here httpscloudgooglecombigqueryexportingdatafrombigquery#exportingmultiple
· compression (str) – Type of compression to use
· export_format (str) – File format to export
· field_delimiter (str) – The delimiter to use when extracting to a CSV
· print_header (bool) – Whether to print a header for a CSV file extract
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
BigQueryHook
class airflowcontribhooksbigquery_hookBigQueryHook(bigquery_conn_id'bigquery_default' delegate_toNone use_legacy_sqlTrue)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook airflowhooksdbapi_hookDbApiHook airflowutilsloglogging_mixinLoggingMixin
Interact with BigQuery This hook uses the Google Cloud Platform connection
get_conn()[source]
Returns a BigQuery PEP 249 connection object
get_pandas_df(sql parametersNone dialectNone)[source]
Returns a Pandas DataFrame for the results produced by a BigQuery query The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections except for SQLite See
httpsgithubcompydatapandasblobmasterpandasiosqlpy#L447 httpsgithubcompydatapandasissues6900
Parameters
· sql (str) – The BigQuery SQL to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with (not used leave to override superclass method)
· dialect (str in {'legacy' 'standard'}) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use selfuse_legacy_sql if not specified
get_service()[source]
Returns a BigQuery service object
insert_rows(table rows target_fieldsNone commit_every1000)[source]
Insertion is currently unsupported Theoretically you could use BigQuery’s streaming API to insert rows into a table but this hasn’t been implemented
table_exists(project_id dataset_id table_id)[source]
Checks for the existence of a table in Google BigQuery
Parameters
· project_id (str) – The Google cloud project in which to look for the table The connection supplied to the hook must provide access to the specified project
· dataset_id (str) – The name of the dataset in which to look for the table
· table_id (str) – The name of the table to check the existence of
Compute Engine
Compute Engine Operators
· GceInstanceStartOperator start an existing Google Compute Engine instance
· GceInstanceStopOperator stop an existing Google Compute Engine instance
· GceSetMachineTypeOperator change the machine type for a stopped instance
GceInstanceStartOperator
class airflowcontriboperatorsgcp_compute_operatorGceInstanceStartOperator(project_id zone resource_id gcp_conn_id'google_cloud_default' api_version'v1' *args **kwargs)[source]
Bases airflowcontriboperatorsgcp_compute_operatorGceBaseOperator
Start an instance in Google Compute Engine
Parameters
· project_id (str) – Google Cloud Platform project where the Compute Engine instance exists
· zone (str) – Google Cloud Platform zone where the instance exists
· resource_id (str) – Name of the Compute Engine instance resource
· gcp_conn_id (str) – The connection ID used to connect to Google Cloud Platform
· api_version (str) – API version used (eg v1)
GceInstanceStopOperator
class airflowcontriboperatorsgcp_compute_operatorGceInstanceStopOperator(project_id zone resource_id gcp_conn_id'google_cloud_default' api_version'v1' *args **kwargs)[source]
Bases airflowcontriboperatorsgcp_compute_operatorGceBaseOperator
Stop an instance in Google Compute Engine
Parameters
· project_id (str) – Google Cloud Platform project where the Compute Engine instance exists
· zone (str) – Google Cloud Platform zone where the instance exists
· resource_id (str) – Name of the Compute Engine instance resource
· gcp_conn_id (str) – The connection ID used to connect to Google Cloud Platform
· api_version (str) – API version used (eg v1)
GceSetMachineTypeOperator
class airflowcontriboperatorsgcp_compute_operatorGceSetMachineTypeOperator(project_id zone resource_id body gcp_conn_id'google_cloud_default' api_version'v1' validate_bodyTrue *args **kwargs)[source]
Bases airflowcontriboperatorsgcp_compute_operatorGceBaseOperator
Changes the machine type for a stopped instance to the machine type specified in the request
Parameters
· project_id (str) – Google Cloud Platform project where the Compute Engine instance exists
· zone (str) – Google Cloud Platform zone where the instance exists
· resource_id (str) – Name of the Compute Engine instance resource
· body (dict) – Body required by the Compute Engine setMachineType API as described in httpscloudgooglecomcomputedocsreferencerestv1instancessetMachineType#requestbody
· gcp_conn_id (str) – The connection ID used to connect to Google Cloud Platform
· api_version (str) – API version used (eg v1)
Cloud Functions
Cloud Functions Operators
· GcfFunctionDeployOperator deploy Google Cloud Function to Google Cloud Platform
· GcfFunctionDeleteOperator delete Google Cloud Function in Google Cloud Platform
GcfFunctionDeployOperator
class airflowcontriboperatorsgcp_function_operatorGcfFunctionDeployOperator(project_id location body gcp_conn_id'google_cloud_default' api_version'v1' zip_pathNone validate_bodyTrue *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a function in Google Cloud Functions
Parameters
· project_id (str) – Google Cloud Platform Project ID where the function should be created
· location (str) – Google Cloud Platform region where the function should be created
· body (dict or googlecloudfunctionsv1CloudFunction) – Body of the Cloud Functions definition The body must be a Cloud Functions dictionary as described in httpscloudgooglecomfunctionsdocsreferencerestv1projectslocationsfunctions Different API versions require different variants of the Cloud Functions dictionary
· gcp_conn_id (str) – The connection ID to use to connect to Google Cloud Platform
· api_version (str) – API version used (for example v1 or v1beta1)
· zip_path (str) – Path to zip file containing source code of the function If the path is set the sourceUploadUrl should not be specified in the body or it should be empty Then the zip file will be uploaded using the upload URL generated via generateUploadUrl from the Cloud Functions API
· validate_body (bool) – If set to False body validation is not performed
GcfFunctionDeleteOperator
class airflowcontriboperatorsgcp_function_operatorGcfFunctionDeleteOperator(name gcp_conn_id'google_cloud_default' api_version'v1' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Deletes the specified function from Google Cloud Functions
Parameters
· name (str) – A fullyqualified function name matching the pattern ^projects[^]+locations[^]+functions[^]+
· gcp_conn_id (str) – The connection ID to use to connect to Google Cloud Platform
· api_version (str) – API version used (for example v1 or v1beta1)
Cloud Functions Hook
class airflowcontribhooksgcp_function_hookGcfHook(api_version gcp_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Hook for the Google Cloud Functions APIs
create_new_function(full_location body)[source]
Creates a new function in Cloud Function in the location specified in the body
Parameters
· full_location (str) – full location including the project in the form of of projectslocation
· body (dict) – body required by the Cloud Functions insert API
Returns
response returned by the operation
Return type
dict
delete_function(name)[source]
Deletes the specified Cloud Function
Parameters
name (str) – name of the function
Returns
response returned by the operation
Return type
dict
get_conn()[source]
Retrieves the connection to Cloud Functions
Returns
Google Cloud Function services object
Return type
dict
get_function(name)[source]
Returns the Cloud Function with the given name
Parameters
name (str) – name of the function
Returns
a CloudFunction object representing the function
Return type
dict
list_functions(full_location)[source]
Lists all Cloud Functions created in the location
Parameters
full_location (str) – full location including the project in the form of of projectslocation
Returns
array of CloudFunction objects representing functions in the location
Return type
[dict]
update_function(name body update_mask)[source]
Updates Cloud Functions according to the specified update mask
Parameters
· name (str) – name of the function
· body (str) – body required by the cloud function patch API
· update_mask ([str]) – update mask array of fields that should be patched
Returns
response returned by the operation
Return type
dict
upload_function_zip(parent zip_path)[source]
Uploads zip file with sources
Parameters
· parent (str) – Google Cloud Platform project id and region where zip file should be uploaded in the form of projectslocation
· zip_path (str) – path of the valid zip file to upload
Returns
Upload URL that was returned by generateUploadUrl method
Cloud DataFlow
DataFlow Operators
· DataFlowJavaOperator launching Cloud Dataflow jobs written in Java
· DataflowTemplateOperator launching a templated Cloud DataFlow batch job
· DataFlowPythonOperator launching Cloud Dataflow jobs written in python
DataFlowJavaOperator
class airflowcontriboperatorsdataflow_operatorDataFlowJavaOperator(jar job_name'{{tasktask_id}}' dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 job_classNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Java Cloud DataFlow batch job The parameters of the operation will be passed to the job
See also
For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
Parameters
· jar (str) – The reference to a self executing DataFlow jar (templated)
· job_name (str) – The jobName’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' in options will be overwritten
· dataflow_default_options (dict) – Map of default job options
· options (dict) – Map of job specific options
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
· job_class (str) – The name of the dataflow job class to be executued it is often not the main class configured in the dataflow jar file
jar options and job_name are templated so you can use variables in them
Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
default_args {
'dataflow_default_options' {
'project' 'mygcpproject'
'zone' 'europewest1d'
'stagingLocation' 'gsmystagingbucketstaging'
}
}
You need to pass the path to your dataflow as a file reference with the jar parameter the jar needs to be a self executing jar (see documentation here httpsbeamapacheorgdocumentationrunnersdataflow#selfexecutingjar) Use options to pass on options to your job
t1 DataFlowJavaOperator(
task_id'datapflow_example'
jar'{{varvaluegcp_dataflow_base}}pipelinebuildlibspipelineexample10jar'
options{
'autoscalingAlgorithm' 'BASIC'
'maxNumWorkers' '50'
'start' '{{ds}}'
'partitionType' 'DAY'
'labels' {'foo' 'bar'}
}
gcp_conn_id'gcpairflowserviceaccount'
dagmydag)
default_args {
'owner' 'airflow'
'depends_on_past' False
'start_date'
(2016 8 1)
'email' ['alex@vanboxelbe']
'email_on_failure' False
'email_on_retry' False
'retries' 1
'retry_delay' timedelta(minutes30)
'dataflow_default_options' {
'project' 'mygcpproject'
'zone' 'uscentral1f'
'stagingLocation' 'gsbuckettmpdataflowstaging'
}
}

dag DAG('testdag' default_argsdefault_args)

task DataFlowJavaOperator(
gcp_conn_id'gcp_default'
task_id'normalizecal'
jar'{{varvaluegcp_dataflow_base}}pipelineingresscalnormalize10jar'
options{
'autoscalingAlgorithm' 'BASIC'
'maxNumWorkers' '50'
'start' '{{ds}}'
'partitionType' 'DAY'

}
dagdag)
DataflowTemplateOperator
class airflowcontriboperatorsdataflow_operatorDataflowTemplateOperator(template job_name'{{tasktask_id}}' dataflow_default_optionsNone parametersNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Templated Cloud DataFlow batch job The parameters of the operation will be passed to the job
Parameters
· template (str) – The reference to the DataFlow template
· job_name – The jobName’ to use when executing the DataFlow template (templated)
· dataflow_default_options (dict) – Map of default job environment options
· parameters (dict) – Map of job specific parameters for the template
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
See also
httpscloudgooglecomdataflowdocsreferencerestv1b3LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment
default_args {
'dataflow_default_options' {
'project' 'mygcpproject'
'zone' 'europewest1d'
'tempLocation' 'gsmystagingbucketstaging'
}
}
}
You need to pass the path to your dataflow template as a file reference with the template parameter Use parameters to pass on parameters to your job Use environment to pass on runtime environment variables to your job
t1 DataflowTemplateOperator(
task_id'datapflow_example'
template'{{varvaluegcp_dataflow_base}}'
parameters{
'inputFile' gsbucketinputmy_inputtxt
'outputFile' gsbucketoutputmy_outputtxt
}
gcp_conn_id'gcpairflowserviceaccount'
dagmydag)
template dataflow_default_options parameters and job_name are templated so you can use variables in them
Note that dataflow_default_options is expected to save highlevel options for project information which apply to all dataflow operators in the DAG
See also
httpscloudgooglecomdataflowdocsreferencerestv1b3 LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment For more detail on job template execution have a look at the reference httpscloudgooglecomdataflowdocstemplatesexecutingtemplates
DataFlowPythonOperator
class airflowcontriboperatorsdataflow_operatorDataFlowPythonOperator(py_file job_name'{{tasktask_id}}' py_optionsNone dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Launching Cloud Dataflow jobs written in python Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
See also
For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
Parameters
· py_file (str) – Reference to the python dataflow pipleline filepy eg somelocalfilepathtoyourpythonpipelinefile
· job_name (str) – The job_name’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' or 'job_name' in options will be overwritten
· py_options – Additional python options
· dataflow_default_options (dict) – Map of default job options
· options (dict) – Map of job specific options
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
execute(context)[source]
Execute the python dataflow job
DataFlowHook
class airflowcontribhooksgcp_dataflow_hookDataFlowHook(gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
get_conn()[source]
Returns a Google Cloud Dataflow service object
Cloud DataProc
DataProc Operators
· DataprocClusterCreateOperator Create a new cluster on Google Cloud Dataproc
· DataprocClusterDeleteOperator Delete a cluster on Google Cloud Dataproc
· DataprocClusterScaleOperator Scale up or down a cluster on Google Cloud Dataproc
· DataProcPigOperator Start a Pig query Job on a Cloud DataProc cluster
· DataProcHiveOperator Start a Hive query Job on a Cloud DataProc cluster
· DataProcSparkSqlOperator Start a Spark SQL query Job on a Cloud DataProc cluster
· DataProcSparkOperator Start a Spark Job on a Cloud DataProc cluster
· DataProcHadoopOperator Start a Hadoop Job on a Cloud DataProc cluster
· DataProcPySparkOperator Start a PySpark Job on a Cloud DataProc cluster
· DataprocWorkflowTemplateInstantiateOperator Instantiate a WorkflowTemplate on Google Cloud Dataproc
· DataprocWorkflowTemplateInstantiateInlineOperator Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc
DataprocClusterCreateOperator
class airflowcontriboperatorsdataproc_operatorDataprocClusterCreateOperator(cluster_name project_id num_workers zone network_uriNone subnetwork_uriNone internal_ip_onlyNone tagsNone storage_bucketNone init_actions_urisNone init_action_timeout'10m' metadataNone custom_imageNone image_versionNone propertiesNone master_machine_type'n1standard4' master_disk_type'pdstandard' master_disk_size500 worker_machine_type'n1standard4' worker_disk_type'pdstandard' worker_disk_size500 num_preemptible_workers0 labelsNone region'global' gcp_conn_id'google_cloud_default' delegate_toNone service_accountNone service_account_scopesNone idle_delete_ttlNone auto_delete_timeNone auto_delete_ttlNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Create a new cluster on Google Cloud Dataproc The operator will wait until the creation is successful or an error occurs in the creation process
The parameters allow to configure the cluster Please refer to
httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters
for a detailed explanation on the different parameters Most of the configuration parameters detailed in the link are available as a parameter to this operator
Parameters
· cluster_name (str) – The name of the DataProc cluster to create (templated)
· project_id (str) – The ID of the google cloud project in which to create the cluster (templated)
· num_workers (int) – The # of workers to spin up If set to zero will spin up cluster in a single node mode
· storage_bucket (str) – The storage bucket to use setting to None lets dataproc generate a custom one for you
· init_actions_uris (list[string]) – List of GCS uri’s containing dataproc initialization scripts
· init_action_timeout (str) – Amount of time executable scripts in init_actions_uris has to complete
· metadata (dict) – dict of keyvalue google compute engine metadata entries to add to all instances
· image_version (str) – the version of software inside the Dataproc cluster
· custom_image – custom Dataproc image for more info see httpscloudgooglecomdataprocdocsguidesdataprocimages
· properties (dict) – dict of properties to set on config files (eg sparkdefaultsconf) see httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters#SoftwareConfig
· master_machine_type (str) – Compute engine machine type to use for the master node
· master_disk_type (str) – Type of the boot disk for the master node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
· master_disk_size (int) – Disk size for the master node
· worker_machine_type (str) – Compute engine machine type to use for the worker nodes
· worker_disk_type (str) – Type of the boot disk for the worker node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
· worker_disk_size (int) – Disk size for the worker nodes
· num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
· labels (dict) – dict of labels to add to the cluster
· zone (str) – The zone where the cluster will be located (templated)
· network_uri (str) – The network uri to be used for machine communication cannot be specified with subnetwork_uri
· subnetwork_uri (str) – The subnetwork uri to be used for machine communication cannot be specified with network_uri
· internal_ip_only (bool) – If true all instances in the cluster will only have internal IP addresses This can only be enabled for subnetwork enabled networks
· tags (list[string]) – The GCE tags to add to all instances
· region (str) – leave as global’ might become relevant in the future (templated)
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· service_account (str) – The service account of the dataproc instances
· service_account_scopes (list[string]) – The URIs of service account scopes to be included
· idle_delete_ttl (int) – The longest duration that cluster would keep alive while staying idle Passing this threshold will cause cluster to be autodeleted A duration in seconds
· auto_delete_time (datetimedatetime) – The time when cluster will be autodeleted
· auto_delete_ttl (int) – The life duration of cluster the cluster will be autodeleted at the end of this duration A duration in seconds (If auto_delete_time is set this parameter will be ignored)
Type
custom_image str
DataprocClusterScaleOperator
class airflowcontriboperatorsdataproc_operatorDataprocClusterScaleOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone num_workers2 num_preemptible_workers0 graceful_decommission_timeoutNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Scale up or down a cluster on Google Cloud Dataproc The operator will wait until the cluster is rescaled
Example
t1 DataprocClusterScaleOperator(
task_id'dataproc_scale'
project_id'myproject'
cluster_name'cluster1'
num_workers10
num_preemptible_workers10
graceful_decommission_timeout'1h'
dagdag)
See also
For more detail on about scaling clusters have a look at the reference httpscloudgooglecomdataprocdocsconceptsconfiguringclustersscalingclusters
Parameters
· cluster_name (str) – The name of the cluster to scale (templated)
· project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
· region (str) – The region for the dataproc cluster (templated)
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· num_workers (int) – The new number of workers
· num_preemptible_workers (int) – The new number of preemptible workers
· graceful_decommission_timeout (str) – Timeout for graceful YARN decomissioning Maximum value is 1d
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
DataprocClusterDeleteOperator
class airflowcontriboperatorsdataproc_operatorDataprocClusterDeleteOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Delete a cluster on Google Cloud Dataproc The operator will wait until the cluster is destroyed
Parameters
· cluster_name (str) – The name of the cluster to create (templated)
· project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
· region (str) – leave as global’ might become relevant in the future (templated)
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
DataProcPigOperator
class airflowcontriboperatorsdataproc_operatorDataProcPigOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pig_propertiesNone dataproc_pig_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Pig query Job on a Cloud DataProc cluster The parameters of the operation will be passed to the cluster
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs
default_args {
'cluster_name' 'cluster1'
'dataproc_pig_jars' [
'gsexampleudfjardatafu120datafujar'
'gsexampleudfjargpig12gpigjar'
]
}
You can pass a pig script as string or file reference Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters
Example
t1 DataProcPigOperator(
task_id'dataproc_pig'
query'a_pig_scriptpig'
variables{'out' 'gsexampleoutput{{ds}}'}
dagdag)
See also
For more detail on about job submission have a look at the reference httpscloudgooglecomdataprocreferencerestv1projectsregionsjobs
Parameters
· query (str) – The query or reference to the query file (pg or pig extension) (templated)
· query_uri (str) – The uri of a pig script on Cloud Storage
· variables (dict) – Map of named parameters for the query (templated)
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_pig_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
DataProcHiveOperator
class airflowcontriboperatorsdataproc_operatorDataProcHiveOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hive_propertiesNone dataproc_hive_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Hive query Job on a Cloud DataProc cluster
Parameters
· query (str) – The query or reference to the query file (q extension)
· query_uri (str) – The uri of a hive script on Cloud Storage
· variables (dict) – Map of named parameters for the query
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes
· cluster_name (str) – The name of the DataProc cluster
· dataproc_hive_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
DataProcSparkSqlOperator
class airflowcontriboperatorsdataproc_operatorDataProcSparkSqlOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster
Parameters
· query (str) – The query or reference to the query file (q extension) (templated)
· query_uri (str) – The uri of a spark sql script on Cloud Storage
· variables (dict) – Map of named parameters for the query (templated)
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
DataProcSparkOperator
class airflowcontriboperatorsdataproc_operatorDataProcSparkOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Spark Job on a Cloud DataProc cluster
Parameters
· main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
· main_class (str) – Name of the job class (use this or the main_jar not both together)
· arguments (list) – Arguments for the job (templated)
· archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
· files (list) – List of files to be copied to the working directory
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
DataProcHadoopOperator
class airflowcontriboperatorsdataproc_operatorDataProcHadoopOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hadoop_propertiesNone dataproc_hadoop_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Hadoop Job on a Cloud DataProc cluster
Parameters
· main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
· main_class (str) – Name of the job class (use this or the main_jar not both together)
· arguments (list) – Arguments for the job (templated)
· archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
· files (list) – List of files to be copied to the working directory
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_hadoop_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_hadoop_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
DataProcPySparkOperator
class airflowcontriboperatorsdataproc_operatorDataProcPySparkOperator(main argumentsNone archivesNone pyfilesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pyspark_propertiesNone dataproc_pyspark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a PySpark Job on a Cloud DataProc cluster
Parameters
· main (str) – [Required] The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver Must be a py file
· arguments (list) – Arguments for the job (templated)
· archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
· files (list) – List of files to be copied to the working directory
· pyfiles (list) – List of Python files to pass to the PySpark framework Supported file types py egg and zip
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster
· dataproc_pyspark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_pyspark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
DataprocWorkflowTemplateInstantiateOperator
class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateOperator(template_id *args **kwargs)[source]
Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
See also
Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiate
Parameters
· template_id (str) – The id of the template (templated)
· project_id (str) – The ID of the google cloud project in which the template runs
· region (str) – leave as global’ might become relevant in the future
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
DataprocWorkflowTemplateInstantiateInlineOperator
class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateInlineOperator(template *args **kwargs)[source]
Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
See also
Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiateInline
Parameters
· template (map) – The template contents (templated)
· project_id (str) – The ID of the google cloud project in which the template runs
· region (str) – leave as global’ might become relevant in the future
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
Cloud Datastore
Datastore Operators
· DatastoreExportOperator Export entities from Google Cloud Datastore to Cloud Storage
· DatastoreImportOperator Import entities from Cloud Storage to Google Cloud Datastore
DatastoreExportOperator
class airflowcontriboperatorsdatastore_export_operatorDatastoreExportOperator(bucket namespaceNone datastore_conn_id'google_cloud_default' cloud_storage_conn_id'google_cloud_default' delegate_toNone entity_filterNone labelsNone polling_interval_in_seconds10 overwrite_existingFalse xcom_pushFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Export entities from Google Cloud Datastore to Cloud Storage
Parameters
· bucket (str) – name of the cloud storage bucket to backup data
· namespace (str) – optional namespace path in the specified Cloud Storage bucket to backup data If this namespace does not exist in GCS it will be created
· datastore_conn_id (str) – the name of the Datastore connection id to use
· cloud_storage_conn_id (str) – the name of the cloud storage connection id to forcewrite backup
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
· labels (dict) – clientassigned labels for cloud storage
· polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
· overwrite_existing (bool) – if the storage bucket + namespace is not empty it will be emptied prior to exports This enables overwriting existing backups
· xcom_push (bool) – push operation name to xcom for reference
DatastoreImportOperator
class airflowcontriboperatorsdatastore_import_operatorDatastoreImportOperator(bucket file namespaceNone entity_filterNone labelsNone datastore_conn_id'google_cloud_default' delegate_toNone polling_interval_in_seconds10 xcom_pushFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Import entities from Cloud Storage to Google Cloud Datastore
Parameters
· bucket (str) – container in Cloud Storage to store data
· file (str) – path of the backup metadata file in the specified Cloud Storage bucket It should have the extension overall_export_metadata
· namespace (str) – optional namespace of the backup metadata file in the specified Cloud Storage bucket
· entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
· labels (dict) – clientassigned labels for cloud storage
· datastore_conn_id (str) – the name of the connection id to use
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
· xcom_push (bool) – push operation name to xcom for reference
DatastoreHook
class airflowcontribhooksdatastore_hookDatastoreHook(datastore_conn_id'google_cloud_datastore_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Interact with Google Cloud Datastore This hook uses the Google Cloud Platform connection
This object is not threads safe If you want to make multiple requests simultaneously you will need to create a hook per thread
allocate_ids(partialKeys)[source]
Allocate IDs for incomplete keys see httpscloudgooglecomdatastoredocsreferencerestv1projectsallocateIds
Parameters
partialKeys – a list of partial keys
Returns
a list of full keys
begin_transaction()[source]
Get a new transaction handle
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectsbeginTransaction
Returns
a transaction handle
commit(body)[source]
Commit a transaction optionally creating deleting or modifying some entities
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectscommit
Parameters
body – the body of the commit request
Returns
the response body of the commit request
delete_operation(name)[source]
Deletes the longrunning operation
Parameters
name – the name of the operation resource
export_to_storage_bucket(bucket namespaceNone entity_filterNone labelsNone)[source]
Export entities from Cloud Datastore to Cloud Storage for backup
get_conn(version'v1')[source]
Returns a Google Cloud Storage service object
get_operation(name)[source]
Gets the latest state of a longrunning operation
Parameters
name – the name of the operation resource
import_from_storage_bucket(bucket file namespaceNone entity_filterNone labelsNone)[source]
Import a backup from Cloud Storage to Cloud Datastore
lookup(keys read_consistencyNone transactionNone)[source]
Lookup some entities by key
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectslookup
Parameters
· keys – the keys to lookup
· read_consistency – the read consistency to use default strong or eventual Cannot be used with a transaction
· transaction – the transaction to use if any
Returns
the response body of the lookup request
poll_operation_until_done(name polling_interval_in_seconds)[source]
Poll backup operation state until it’s completed
rollback(transaction)[source]
Roll back a transaction
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectsrollback
Parameters
transaction – the transaction to roll back
run_query(body)[source]
Run a query for entities
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectsrunQuery
Parameters
body – the body of the query request
Returns
the batch of query results
Cloud ML Engine
Cloud ML Engine Operators
· MLEngineBatchPredictionOperator Start a Cloud ML Engine batch prediction job
· MLEngineModelOperator Manages a Cloud ML Engine model
· MLEngineTrainingOperator Start a Cloud ML Engine training job
· MLEngineVersionOperator Manages a Cloud ML Engine model version
MLEngineBatchPredictionOperator
class airflowcontriboperatorsmlengine_operatorMLEngineBatchPredictionOperator(project_id job_id region data_format input_paths output_path model_nameNone version_nameNone uriNone max_worker_countNone runtime_versionNone gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Google Cloud ML Engine prediction job
NOTE For model origin users should consider exactly one from the three options below 1 Populate uri’ field only which should be a GCS location that points to a tensorflow savedModel directory 2 Populate model_name’ field only which refers to an existing model and the default version of the model will be used 3 Populate both model_name’ and version_name’ fields which refers to a specific version of a specific model
In options 2 and 3 both model and version name should contain the minimal identifier For instance call
MLEngineBatchPredictionOperator(

model_name'my_model'
version_name'my_version'
)
if the desired model version is projectsmy_projectmodelsmy_modelversionsmy_version
See httpscloudgooglecommlenginereferencerestv1projectsjobs for further documentation on the parameters
Parameters
· project_id (str) – The Google Cloud project name where the prediction job is submitted (templated)
· job_id (str) – A unique id for the prediction job on Google Cloud ML Engine (templated)
· data_format (str) – The format of the input data It will default to DATA_FORMAT_UNSPECIFIED’ if is not provided or is not one of [TEXT TF_RECORD TF_RECORD_GZIP]
· input_paths (list of string) – A list of GCS paths of input data for batch prediction Accepting wildcard operator * but only at the end (templated)
· output_path (str) – The GCS path where the prediction results are written to (templated)
· region (str) – The Google Compute Engine region to run the prediction job in (templated)
· model_name (str) – The Google Cloud ML Engine model to use for prediction If version_name is not provided the default version of this model will be used Should not be None if version_name is provided Should be None if uri is provided (templated)
· version_name (str) – The Google Cloud ML Engine model version to use for prediction Should be None if uri is provided (templated)
· uri (str) – The GCS path of the saved model to use for prediction Should be None if model_name is provided It should be a GCS path pointing to a tensorflow SavedModel (templated)
· max_worker_count (int) – The maximum number of workers to be used for parallel processing Defaults to 10 if not specified
· runtime_version (str) – The Google Cloud ML Engine runtime version to use for batch prediction
· gcp_conn_id (str) – The connection ID used for connection to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have doaminwide delegation enabled
Raises
ValueError if a unique modelversion origin cannot be determined
MLEngineModelOperator
class airflowcontriboperatorsmlengine_operatorMLEngineModelOperator(project_id model operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator for managing a Google Cloud ML Engine model
Parameters
· project_id (str) – The Google Cloud project name to which MLEngine model belongs (templated)
· model (dict) –
A dictionary containing the information about the model If the operation is create then the model parameter should contain all the information about this model such as name
If the operation is get the model parameter should contain the name of the model
· operation (str) –
The operation to perform Available operations are
o create Creates a new model as provided by the model parameter
o get Gets a particular model where the name is specified in model
· gcp_conn_id (str) – The connection ID to use when fetching connection info
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
MLEngineTrainingOperator
class airflowcontriboperatorsmlengine_operatorMLEngineTrainingOperator(project_id job_id package_uris training_python_module training_args region scale_tierNone runtime_versionNone python_versionNone job_dirNone gcp_conn_id'google_cloud_default' delegate_toNone mode'PRODUCTION' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator for launching a MLEngine training job
Parameters
· project_id (str) – The Google Cloud project name within which MLEngine training job should run (templated)
· job_id (str) – A unique templated id for the submitted Google MLEngine training job (templated)
· package_uris (str) – A list of package locations for MLEngine training job which should include the main training program + any additional dependencies (templated)
· training_python_module (str) – The Python module name to run within MLEngine training job after installing package_uris’ packages (templated)
· training_args (str) – A list of templated command line arguments to pass to the MLEngine training program (templated)
· region (str) – The Google Compute Engine region to run the MLEngine training job in (templated)
· scale_tier (str) – Resource tier for MLEngine training job (templated)
· runtime_version (str) – The Google Cloud ML runtime version to use for training (templated)
· python_version (str) – The version of Python used in training (templated)
· job_dir (str) – A Google Cloud Storage path in which to store training outputs and other data needed for training (templated)
· gcp_conn_id (str) – The connection ID to use when fetching connection info
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· mode (str) – Can be one of DRY_RUN’’CLOUD’ In DRY_RUN’ mode no real training job will be launched but the MLEngine training job request will be printed out In CLOUD’ mode a real MLEngine training job creation request will be issued
MLEngineVersionOperator
class airflowcontriboperatorsmlengine_operatorMLEngineVersionOperator(project_id model_name version_nameNone versionNone operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator for managing a Google Cloud ML Engine version
Parameters
· project_id (str) – The Google Cloud project name to which MLEngine model belongs
· model_name (str) – The name of the Google Cloud ML Engine model that the version belongs to (templated)
· version_name (str) – A name to use for the version being operated upon If not None and the version argument is None or does not have a value for the name key then this will be populated in the payload for the name key (templated)
· version (dict) – A dictionary containing the information about the version If the operation is create version should contain all the information about this version such as name and deploymentUrl If the operation is get or delete the version parameter should contain the name of the version If it is None the only operation possible would be list (templated)
· operation (str) –
The operation to perform Available operations are
o create Creates a new version in the model specified by model_name in which case the version parameter should contain all the information to create that version (eg name deploymentUrl)
o get Gets full information of a particular version in the model specified by model_name The name of the version should be specified in the version parameter
o list Lists all available versions of the model specified by model_name
o delete Deletes the version specified in version parameter from the model specified by model_name) The name of the version should be specified in the version parameter
· gcp_conn_id (str) – The connection ID to use when fetching connection info
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
Cloud ML Engine Hook
MLEngineHook
class airflowcontribhooksgcp_mlengine_hookMLEngineHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
create_job(project_id job use_existing_job_fnNone)[source]
Launches a MLEngine job and wait for it to reach a terminal state
Parameters
· project_id (str) – The Google Cloud project id within which MLEngine job will be launched
· job (dict) –
MLEngine Job object that should be provided to the MLEngine API such as
{
'jobId' 'my_job_id'
'trainingInput' {
'scaleTier' 'STANDARD_1'

}
}
· use_existing_job_fn (function) – In case that a MLEngine job with the same job_id already exist this method (if provided) will decide whether we should use this existing job continue waiting for it to finish and returning the job object It should accepts a MLEngine job object and returns a boolean value indicating whether it is OK to reuse the existing job If use_existing_job_fn’ is not provided we by default reuse the existing MLEngine job
Returns
The MLEngine job object if the job successfully reach a terminal state (which might be FAILED or CANCELLED state)
Return type
dict
create_model(project_id model)[source]
Create a Model Blocks until finished
create_version(project_id model_name version_spec)[source]
Creates the Version on Google Cloud ML Engine
Returns the operation if the version was created successfully and raises an error otherwise
delete_version(project_id model_name version_name)[source]
Deletes the given version of a model Blocks until finished
get_conn()[source]
Returns a Google MLEngine service object
get_model(project_id model_name)[source]
Gets a Model Blocks until finished
list_versions(project_id model_name)[source]
Lists all available versions of a model Blocks until finished
set_default_version(project_id model_name version_name)[source]
Sets a version to be the default Blocks until finished
Cloud Storage
Storage Operators
· FileToGoogleCloudStorageOperator Uploads a file to Google Cloud Storage
· GoogleCloudStorageCreateBucketOperator Creates a new cloud storage bucket
· GoogleCloudStorageListOperator List all objects from the bucket with the give string prefix and delimiter in name
· GoogleCloudStorageDownloadOperator Downloads a file from Google Cloud Storage
· GoogleCloudStorageToBigQueryOperator Loads files from Google cloud storage into BigQuery
· GoogleCloudStorageToGoogleCloudStorageOperator Copies objects from a bucket to another with renaming if requested
FileToGoogleCloudStorageOperator
class airflowcontriboperatorsfile_to_gcsFileToGoogleCloudStorageOperator(src dst bucket google_cloud_storage_conn_id'google_cloud_default' mime_type'applicationoctetstream' delegate_toNone gzipFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Uploads a file to Google Cloud Storage Optionally can compress the file for upload
Parameters
· src (str) – Path to the local file (templated)
· dst (str) – Destination path within the specified bucket (templated)
· bucket (str) – The bucket to upload to (templated)
· google_cloud_storage_conn_id (str) – The Airflow connection ID to upload with
· mime_type (str) – The mimetype string
· delegate_to (str) – The account to impersonate if any
· gzip (bool) – Allows for file to be compressed and uploaded as gzip
execute(context)[source]
Uploads the file to Google cloud storage
GoogleCloudStorageCreateBucketOperator
class airflowcontriboperatorsgcs_operatorGoogleCloudStorageCreateBucketOperator(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
See also
For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
Parameters
· bucket_name (str) – The name of the bucket (templated)
· storage_class (str) –
This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated) Values include
o MULTI_REGIONAL
o REGIONAL
o STANDARD
o NEARLINE
o COLDLINE
If this value is not specified when the bucket is created it will default to STANDARD
· location (str) –
The location of the bucket (templated) Object data for objects in the bucket resides in physical storage within this region Defaults to US
See also
httpsdevelopersgooglecomstoragedocsbucketlocations
· project_id (str) – The ID of the GCP Project (templated)
· labels (dict) – Userprovided labels in keyvalue pairs
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
Example
The following Operator would create a new bucket testbucket with MULTI_REGIONAL storage class in EU region
CreateBucket GoogleCloudStorageCreateBucketOperator(
task_id'CreateNewBucket'
bucket_name'testbucket'
storage_class'MULTI_REGIONAL'
location'EU'
labels{'env' 'dev' 'team' 'airflow'}
google_cloud_storage_conn_id'airflowserviceaccount'
)
GoogleCloudStorageDownloadOperator
class airflowcontriboperatorsgcs_download_operatorGoogleCloudStorageDownloadOperator(bucket object filenameNone store_to_xcom_keyNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Downloads a file from Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is (templated)
· object (str) – The name of the object to download in the Google cloud storage bucket (templated)
· filename (str) – The file path on the local file system (where the operator is being executed) that the file should be downloaded to (templated) If no filename passed the downloaded data will not be stored on the local file system
· store_to_xcom_key (str) – If this param is set the operator will push the contents of the downloaded file to XCom with the key set in this parameter If not set the downloaded data will not be pushed to XCom (templated)
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
GoogleCloudStorageListOperator
class airflowcontriboperatorsgcs_list_operatorGoogleCloudStorageListOperator(bucket prefixNone delimiterNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
List all objects from the bucket with the give string prefix and delimiter in name
This operator returns a python list with the name of objects which can be used by
xcom in the downstream task
Parameters
· bucket (str) – The Google cloud storage bucket to find the objects (templated)
· prefix (str) – Prefix string which filters objects whose name begin with this prefix (templated)
· delimiter (str) – The delimiter by which you want to filter the objects (templated) For eg to lists the CSV files from in a directory in GCS you would use delimiter’csv’
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
Example
The following Operator would list all the Avro files from salessales2017 folder in data bucket
GCS_Files GoogleCloudStorageListOperator(
task_id'GCS_Files'
bucket'data'
prefix'salessales2017'
delimiter'avro'
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
GoogleCloudStorageToBigQueryOperator
class airflowcontriboperatorsgcs_to_bqGoogleCloudStorageToBigQueryOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' create_disposition'CREATE_IF_NEEDED' skip_leading_rows0 write_disposition'WRITE_EMPTY' field_delimiter' ' max_bad_records0 quote_characterNone ignore_unknown_valuesFalse allow_quoted_newlinesFalse allow_jagged_rowsFalse max_id_keyNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone schema_update_options() src_fmt_configsNone external_tableFalse time_partitioningNone cluster_fieldsNone autodetectFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Loads files from Google cloud storage into BigQuery
The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
Parameters
· bucket (str) – The bucket to load from (templated)
· source_objects (list of str) – List of Google cloud storage URIs to load from (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
· destination_project_dataset_table (str) – The dotted ()
BigQuery table to load data into If is not included project will be the project defined in the connection json (templated)
· schema_fields (list) – If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencev2jobs#configurationload Should not be set when source_format is DATASTORE_BACKUP’
· schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
· source_format (str) – File format to export
· compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
· create_disposition (str) – The create disposition if the table doesn’t exist
· skip_leading_rows (int) – Number of rows to skip when loading from a CSV
· write_disposition (str) – The write disposition if the table already exists
· field_delimiter (str) – The delimiter to use when loading from a CSV
· max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
· quote_character (str) – The value that is used to quote data sections in a CSV file
· ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema If true the extra values are ignored If false records with extra columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result
· allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
· allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
· max_id_key (str) – If set the name of a column in the BigQuery table that’s to be loaded This will be used to select the MAX value from BigQuery after the load occurs The results will be returned by the execute() command which in turn gets stored in XCom for future operators to use This can be helpful with incremental loads–during future executions you can pick up from the max ID
· bigquery_conn_id (str) – Reference to a specific BigQuery hook
· google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· schema_update_options (list) – Allows the schema of the destination table to be updated as a side effect of the load job
· src_fmt_configs (dict) – configure optional fields specific to the source format
· external_table (bool) – Flag to specify if the destination table should be a BigQuery external table Default Value is False
· time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications Note that field’ is not available in concurrency with datasettablepartition
· cluster_fields (list of str) – Request that the result of this load be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order Not applicable for external tables
GoogleCloudStorageToGoogleCloudStorageOperator
class airflowcontriboperatorsgcs_to_gcsGoogleCloudStorageToGoogleCloudStorageOperator(source_bucket source_object destination_bucketNone destination_objectNone move_objectFalse google_cloud_storage_conn_id'google_cloud_default' delegate_toNone last_modified_timeNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copies objects from a bucket to another with renaming if requested
Parameters
· source_bucket (str) – The source Google cloud storage bucket where the object is (templated)
· source_object (str) – The source name of the object to copy in the Google cloud storage bucket (templated) You can use only one wildcard for objects (filenames) within your bucket The wildcard can appear inside the object name or at the end of the object name Appending a wildcard to the bucket name is unsupported
· destination_bucket (str) – The destination Google cloud storage bucket where the object should be (templated)
· destination_object (str) – The destination name of the object in the destination Google cloud storage bucket (templated) If a wildcard is supplied in the source_object argument this is the prefix that will be prepended to the final destination objects’ paths Note that the source path’s part before the wildcard will be removed if it needs to be retained it should be appended to destination_object For example with prefix foo* and destination_object blah the file foobaz will be copied to blahbaz to retain the prefix write the destination_object as eg blahfoo in which case the copied file will be named blahfoobaz
· move_object (bool) – When move object is True the object is moved instead of copied to the new location This is the equivalent of a mv command as opposed to a cp command
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· last_modified_time (datetime) – When specified if the object(s) were modified after last_modified_time they will be copiedmoved If tzinfo has not been set UTC will be assumed
Examples
The following Operator would copy a single file named salessales2017januaryavro in the data bucket to the file named copied_sales2017januarybackupavro in the data_backup bucket
copy_single_file GoogleCloudStorageToGoogleCloudStorageOperator(
task_id'copy_single_file'
source_bucket'data'
source_object'salessales2017januaryavro'
destination_bucket'data_backup'
destination_object'copied_sales2017januarybackupavro'
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
The following Operator would copy all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the copied_sales2017 folder in the data_backup bucket
copy_files GoogleCloudStorageToGoogleCloudStorageOperator(
task_id'copy_files'
source_bucket'data'
source_object'salessales2017*avro'
destination_bucket'data_backup'
destination_object'copied_sales2017'
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
The following Operator would move all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the same folder in the data_backup bucket deleting the original files in the process
move_files GoogleCloudStorageToGoogleCloudStorageOperator(
task_id'move_files'
source_bucket'data'
source_object'salessales2017*avro'
destination_bucket'data_backup'
move_objectTrue
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
GoogleCloudStorageHook
class airflowcontribhooksgcs_hookGoogleCloudStorageHook(google_cloud_storage_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Interact with Google Cloud Storage This hook uses the Google Cloud Platform connection
copy(source_bucket source_object destination_bucketNone destination_objectNone)[source]
Copies an object from a bucket to another with renaming if requested
destination_bucket or destination_object can be omitted in which case source bucketobject is used but not both
Parameters
· source_bucket (str) – The bucket of the object to copy from
· source_object (str) – The object to copy
· destination_bucket (str) – The destination of the object to copied to Can be omitted then the same bucket is used
· destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
create_bucket(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone)[source]
Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
See also
For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
Parameters
· bucket_name (str) – The name of the bucket
· storage_class (str) –
This defines how objects in the bucket are stored and determines the SLA and the cost of storage Values include
o MULTI_REGIONAL
o REGIONAL
o STANDARD
o NEARLINE
o COLDLINE
If this value is not specified when the bucket is created it will default to STANDARD
· location (str) –
The location of the bucket Object data for objects in the bucket resides in physical storage within this region Defaults to US
See also
httpsdevelopersgooglecomstoragedocsbucketlocations
· project_id (str) – The ID of the GCP Project
· labels (dict) – Userprovided labels in keyvalue pairs
Returns
If successful it returns the id of the bucket
delete(bucket object generationNone)[source]
Delete an object if versioning is not enabled for the bucket or if generation parameter is used
Parameters
· bucket (str) – name of the bucket where the object resides
· object (str) – name of the object to delete
· generation (str) – if present permanently delete the object of this generation
Returns
True if succeeded
download(bucket object filenameNone)[source]
Get a file from Google Cloud Storage
Parameters
· bucket (str) – The bucket to fetch from
· object (str) – The object to fetch
· filename (str) – If set a local file path where the file should be written to
exists(bucket object)[source]
Checks for the existence of a file in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
get_conn()[source]
Returns a Google Cloud Storage service object
get_crc32c(bucket object)[source]
Gets the CRC32c checksum of an object in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
get_md5hash(bucket object)[source]
Gets the MD5 hash of an object in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
get_size(bucket object)[source]
Gets the size of a file in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
is_updated_after(bucket object ts)[source]
Checks if an object is updated in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
· ts (datetime) – The timestamp to check against
list(bucket versionsNone maxResultsNone prefixNone delimiterNone)[source]
List all objects from the bucket with the give string prefix in name
Parameters
· bucket (str) – bucket name
· versions (bool) – if true list all versions of the objects
· maxResults (int) – max count of items to return in a single page of responses
· prefix (str) – prefix string which filters objects whose name begin with this prefix
· delimiter (str) – filters objects based on the delimiter (for eg csv’)
Returns
a stream of object names matching the filtering criteria
rewrite(source_bucket source_object destination_bucket destination_objectNone)[source]
Has the same functionality as copy except that will work on files over 5 TB as well as when copying between locations andor storage classes
destination_object can be omitted in which case source_object is used
Parameters
· source_bucket (str) – The bucket of the object to copy from
· source_object (str) – The object to copy
· destination_bucket (str) – The destination of the object to copied to
· destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
upload(bucket object filename mime_type'applicationoctetstream' gzipFalse)[source]
Uploads a local file to Google Cloud Storage
Parameters
· bucket (str) – The bucket to upload to
· object (str) – The object name to set when uploading the local file
· filename (str) – The local file path to the file to be uploaded
· mime_type (str) – The MIME type to set when uploading the file
· gzip (bool) – Option to compress file for upload
Google Kubernetes Engine
Google Kubernetes Engine Cluster Operators
· GKEClusterDeleteOperator Creates a Kubernetes Cluster in Google Cloud Platform
· GKEPodOperator Deletes a Kubernetes Cluster in Google Cloud Platform
GKEClusterCreateOperator
GKEClusterDeleteOperator
GKEPodOperator
Google Kubernetes Engine Hook
class airflowcontribhooksgcp_container_hookGKEClusterHook(project_id location)[source]
Bases airflowhooksbase_hookBaseHook
create_cluster(cluster retry timeout)[source]
Creates a cluster consisting of the specified number and type of Google Compute Engine instances
Parameters
· cluster (dict or googlecloudcontainer_v1typesCluster) – A Cluster protobuf or dict If dict is provided it must be of the same form as the protobuf message googlecloudcontainer_v1typesCluster
· retry (googleapi_coreretryRetry) – A retry object (googleapi_coreretryRetry) used to retry requests If None is specified requests will not be retried
· timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
Returns
The full url to the new or existing cluster
raises
ParseError On JSON parsing problems when trying to convert dict AirflowException cluster is not dict type nor Cluster proto type
delete_cluster(name retry timeout)[source]
Deletes the cluster including the Kubernetes endpoint and all worker nodes Firewalls and routes that were configured during cluster creation are also deleted Other Google Compute Engine resources that might be in use by the cluster (eg load balancer resources) will not be deleted if they weren’t present at the initial create time
Parameters
· name (str) – The name of the cluster to delete
· retry (googleapi_coreretryRetry) – Retry object used to determine whenif to retry requests If None is specified requests will not be retried
· timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
Returns
The full url to the delete operation if successful else None
get_cluster(name retry timeout)[source]
Gets details of specified cluster
Parameters
· name (str) – The name of the cluster to retrieve
· retry (googleapi_coreretryRetry) – A retry object used to retry requests If None is specified requests will not be retried
· timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
Returns
A googlecloudcontainer_v1typesCluster instance
get_operation(operation_name)[source]
Fetches the operation from Google Cloud
Parameters
operation_name (str) – Name of operation to fetch
Returns
The new updated operation from Google Cloud
wait_for_operation(operation)[source]
Given an operation continuously fetches the status from Google Cloud until either completion or an error occurring
Parameters
operation (A googlecloudcontainer_V1gapicenumsOperator) – The Operation to wait for
Returns
A new updated operation fetched from Google Cloud
Qubole
Apache Airflow has a native operator and hooks to talk to Qubole which lets you submit your big data jobs directly to Qubole from Apache Airflow
QuboleOperator
QubolePartitionSensor
QuboleFileSensor
Lineage
Note
Lineage support is very experimental and subject to change
Airflow can help track origins of data what happens to it and where it moves over time This can aid having audit trails and data governance but also debugging of data flows
Airflow tracks data by means of inlets and outlets of the tasks Let’s work from an example and see how it works
from airflowoperatorsbash_operator import BashOperator
from airflowoperatorsdummy_operator import DummyOperator
from airflowlineagedatasets import File
from airflowmodels import DAG
from datetime import timedelta

FILE_CATEGORIES [CAT1 CAT2 CAT3]

args {
'owner' 'airflow'
'start_date' airflowutilsdatesdays_ago(2)
}

dag DAG(
dag_id'example_lineage' default_argsargs
schedule_interval'0 0 * * *'
dagrun_timeouttimedelta(minutes60))

f_final File(tmpfinal)
run_this_last DummyOperator(task_id'run_this_last' dagdag
inlets{auto True}
outlets{datasets [f_final]})

f_in File(tmpwhole_directory)
outlets []
for file in FILE_CATEGORIES
f_out File(tmp{}{{{{ execution_date }}}}format(file))
outletsappend(f_out)
run_this BashOperator(
task_id'run_me_first' bash_command'echo 1' dagdag
inlets{datasets [f_in]}
outlets{datasets outlets}
)
run_thisset_downstream(run_this_last)
Tasks take the parameters inlets and outlets Inlets can be manually defined by a list of dataset {datasets [dataset1 dataset2]} or can be configured to look for outlets from upstream tasks {task_ids [task_id1 task_id2]} or can be configured to pick up outlets from direct upstream tasks {auto True} or a combination of them Outlets are defined as list of dataset {datasets [dataset1 dataset2]} Any fields for the dataset are templated with the context when the task is being executed
Note
Operators can add inlets and outlets automatically if the operator supports it
In the example DAG task run_me_first is a BashOperator that takes 3 inlets CAT1 CAT2 CAT3 that are generated from a list Note that execution_date is a templated field and will be rendered when the task is running
Note
Behind the scenes Airflow prepares the lineage metadata as part of the pre_execute method of a task When the task has finished execution post_execute is called and lineage metadata is pushed into XCOM Thus if you are creating your own operators that override this method make sure to decorate your method with prepare_lineage and apply_lineage respectively
Apache Atlas
Airflow can send its lineage metadata to Apache Atlas You need to enable the atlas backend and configure it properly eg in your airflowcfg
[lineage]
backend airflowlineagebackendatlas

[atlas]
username my_username
password my_password
host host
port 21000
Please make sure to have the atlasclient package installed
常见问题
Why isn’t my task getting scheduled
There are very many reasons why your task might not be getting scheduled Here are some of the common causes
· Does your script compile can the Airflow engine parse it and find your DAG object To test this you can run airflow list_dags and confirm that your DAG shows up in the list You can also run airflow list_tasks foo_dag_id tree and confirm that your task shows up in the list as expected If you use the CeleryExecutor you may want to confirm that this works both where the scheduler runs as well as where the worker runs
· Does the file containing your DAG contain the string airflow and DAG somewhere in the contents When searching the DAG directory Airflow ignores files not containing airflow and DAG in order to prevent the DagBag parsing from importing all python files collocated with user’s DAGs
· Is your start_date set properly The Airflow scheduler triggers the task soon after the start_date + scheduler_interval is passed
· Is your schedule_interval set properly The default schedule_interval is one day (datetimetimedelta(1)) You must specify a different schedule_interval directly to the DAG object you instantiate not as a default_param as task instances do not override their parent DAG’s schedule_interval
· Is your start_date beyond where you can see it in the UI If you set your start_date to some time say 3 months ago you won’t be able to see it in the main view in the UI but you should be able to see it in the Menu > Browse >Task Instances
· Are the dependencies for the task met The task instances directly upstream from the task need to be in a success state Also if you have set depends_on_pastTrue the previous task instance needs to have succeeded (except if it is the first run for that task) Also if wait_for_downstreamTrue make sure you understand what it means You can view how these properties are set from the Task Instance Details page for your task
· Are the DagRuns you need created and active A DagRun represents a specific execution of an entire DAG and has a state (running success failed …) The scheduler creates new DagRun as it moves forward but never goes back in time to create new ones The scheduler only evaluates running DagRuns to see what task instances it can trigger Note that clearing tasks instances (from the UI or CLI) does set the state of a DagRun back to running You can bulk view the list of DagRuns and alter states by clicking on the schedule tag for a DAG
· Is the concurrency parameter of your DAG reached concurrency defines how many running task instances a DAG is allowed to have beyond which point things get queued
· Is the max_active_runs parameter of your DAG reached max_active_runs defines how many running concurrent instances of a DAG there are allowed to be
You may also want to read the Scheduler section of the docs and make sure you fully understand how it proceeds
How do I trigger tasks based on another task’s failure
Check out the Trigger Rule section in the Concepts section of the documentation
Why are connection passwords still not encrypted in the metadata db after I installed airflow[crypto]
Check out the Securing Connections section in the Howto Guides section of the documentation
What’s the deal with start_date
start_date is partly legacy from the preDagRun era but it is still relevant in many ways When creating a new DAG you probably want to set a global start_date for your tasks using default_args The first DagRun to be created will be based on the min(start_date) for all your task From that point on the scheduler creates new DagRuns based on your schedule_interval and the corresponding task instances run as your dependencies are met When introducing new tasks to your DAG you need to pay special attention to start_date and may want to reactivate inactive DagRuns to get the new task onboarded properly
We recommend against using dynamic values as start_date especially datetimenow() as it can be quite confusing The task is triggered once the period closes and in theory an @hourly DAG would never get to an hour after now as now() moves along
Previously we also recommended using rounded start_date in relation to your schedule_interval This meant an @hourly would be at 0000 minutesseconds a @daily job at midnight a @monthly job on the first of the month This is no longer required Airflow will now auto align the start_date and the schedule_interval by using the start_date as the moment to start looking
You can use any sensor or a TimeDeltaSensor to delay the execution of tasks within the schedule interval While schedule_interval does allow specifying a datetimetimedelta object we recommend using the macros or cron expressions instead as it enforces this idea of rounded schedules
When using depends_on_pastTrue it’s important to pay special attention to start_date as the past dependency is not enforced only on the specific schedule of the start_date specified for the task It’s also important to watch DagRun activity status in time when introducing new depends_on_pastTrue unless you are planning on running a backfill for the new task(s)
Also important to note is that the tasks start_date in the context of a backfill CLI command get overridden by the backfill’s command start_date This allows for a backfill on tasks that have depends_on_pastTrue to actually start if that wasn’t the case the backfill just wouldn’t start
How can I create DAGs dynamically
Airflow looks in your DAGS_FOLDER for modules that contain DAG objects in their global namespace and adds the objects it finds in the DagBag Knowing this all we need is a way to dynamically assign variable in the global namespace which is easily done in python using the globals() function for the standard library which behaves like a simple dictionary
for i in range(10)
dag_id 'foo_{}'format(i)
globals()[dag_id] DAG(dag_id)
# or better call a function that returns a DAG object
What are all the airflow run commands in my process list
There are many layers of airflow run commands meaning it can call itself
· Basic airflow run fires up an executor and tell it to run an airflow run local command If using Celery this means it puts a command in the queue for it to run remotely on the worker If using LocalExecutor that translates into running it in a subprocess pool
· Local airflow run local starts an airflow run raw command (described below) as a subprocess and is in charge of emitting heartbeats listening for external kill signals and ensures some cleanup takes place if the subprocess fails
· Raw airflow run raw runs the actual operator’s execute method and performs the actual work
How can my airflow dag run faster
There are three variables we could control to improve airflow dag performance
· parallelism This variable controls the number of task instances that the airflow worker can run simultaneously User could increase the parallelism variable in the airflowcfg
· concurrency The Airflow scheduler will run no more than concurrency task instances for your DAG at any given time Concurrency is defined in your Airflow DAG If you do not set the concurrency on your DAG the scheduler will use the default value from the dag_concurrency entry in your airflowcfg
· max_active_runs the Airflow scheduler will run no more than max_active_runs DagRuns of your DAG at a given time If you do not set the max_active_runs in your DAG the scheduler will use the default value from the max_active_runs_per_dag entry in your airflowcfg
How can we reduce the airflow UI page load time
If your dag takes long time to load you could reduce the value of default_dag_run_display_number configuration in airflowcfg to a smaller value This configurable controls the number of dag run to show in UI with default value 25
How to fix Exception Global variable explicit_defaults_for_timestamp needs to be on (1)
This means explicit_defaults_for_timestamp is disabled in your mysql server and you need to enable it by
1 Set explicit_defaults_for_timestamp 1 under the mysqld section in your mycnf file
2 Restart the Mysql server
How to reduce airflow dag scheduling latency in production
· max_threads Scheduler will spawn multiple threads in parallel to schedule dags This is controlled by max_threads with default value of 2 User should increase this value to a larger value(eg numbers of cpus where scheduler runs 1) in production
· scheduler_heartbeat_sec User should consider to increase scheduler_heartbeat_sec config to a higher value(eg 60 secs) which controls how frequent the airflow scheduler gets the heartbeat and updates the job’s entry in database
API 参考
Operator
Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated All operators derive from BaseOperator and inherit many attributes and methods that way Refer to the BaseOperator documentation for more details
There are 3 main types of operators
· Operators that performs an action or tell another system to perform an action
· Transfer operators move data from one system to another
· Sensors are a certain type of operator that will keep running until a certain criterion is met Examples include a specific file landing in HDFS or S3 a partition appearing in Hive or a specific time of the day Sensors are derived from BaseSensorOperator and run a poke method at a specified poke_interval until it returns True
BaseOperator
All operators are derived from BaseOperator and acquire much functionality through inheritance Since this is the core of the engine it’s worth taking the time to understand the parameters of BaseOperator to understand the primitive features that can be leveraged in your DAGs
class airflowmodelsBaseOperator(task_id owner'Airflow' emailNone email_on_retryTrue email_on_failureTrue retries0 retry_delaydatetimetimedelta(0 300) retry_exponential_backoffFalse max_retry_delayNone start_dateNone end_dateNone schedule_intervalNone depends_on_pastFalse wait_for_downstreamFalse dagNone paramsNone default_argsNone adhocFalse priority_weight1 weight_ruleu'downstream' queue'default' poolNone slaNone execution_timeoutNone on_failure_callbackNone on_success_callbackNone on_retry_callbackNone trigger_ruleu'all_success' resourcesNone run_as_userNone task_concurrencyNone executor_configNone inletsNone outletsNone *args **kwargs)[source]
Bases airflowutilsloglogging_mixinLoggingMixin
Abstract base class for all operators Since operators create objects that become nodes in the dag BaseOperator contains many recursive methods for dag crawling behavior To derive this class you are expected to override the constructor as well as the execute’ method
Operators derived from this class should perform or trigger certain tasks synchronously (wait for completion) Example of operators could be an operator that runs a Pig job (PigOperator) a sensor operator that waits for a partition to land in Hive (HiveSensorOperator) or one that moves data from Hive to MySQL (Hive2MySqlOperator) Instances of these operators (tasks) target specific operations running specific scripts functions or data transfers
This class is abstract and shouldn’t be instantiated Instantiating a class derived from this one results in the creation of a task object which ultimately becomes a node in DAG objects Task dependencies should be set by using the set_upstream andor set_downstream methods
Parameters
· task_id (str) – a unique meaningful id for the task
· owner (str) – the owner of the task using the unix username is recommended
· retries (int) – the number of retries that should be performed before failing the task
· retry_delay (timedelta) – delay between retries
· retry_exponential_backoff (bool) – allow progressive longer waits between retries by using exponential backoff algorithm on retry delay (delay will be converted into seconds)
· max_retry_delay (timedelta) – maximum delay interval between retries
· start_date (datetime) – The start_date for the task determines the execution_date for the first task instance The best practice is to have the start_date rounded to your DAG’s schedule_interval Daily jobs have their start_date some day at 000000 hourly jobs have their start_date at 0000 of a specific hour Note that Airflow simply looks at the latest execution_date and adds the schedule_interval to determine the next execution_date It is also very important to note that different tasks’ dependencies need to line up in time If task A depends on task B and their start_date are offset in a way that their execution_date don’t line up A’s dependencies will never be met If you are looking to delay a task for example running a daily task at 2AM look into the TimeSensor and TimeDeltaSensor We advise against using dynamic start_date and recommend using fixed ones Read the FAQ entry about start_date for more information
· end_date (datetime) – if specified the scheduler won’t go beyond this date
· depends_on_past (bool) – when set to true task instances will run sequentially while relying on the previous task’s schedule to succeed The task instance for the start_date is allowed to run
· wait_for_downstream (bool) – when set to true an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully before it runs This is useful if the different instances of a task X alter the same asset and this asset is used by tasks downstream of task X Note that depends_on_past is forced to True wherever wait_for_downstream is used
· queue (str) – which queue to target when running this job Not all executors implement queue management the CeleryExecutor does support targeting specific queues
· dag (DAG) – a reference to the dag the task is attached to (if any)
· priority_weight (int) – priority weight of this task against other task This allows the executor to trigger higher priority tasks before others when things get backed up Set priority_weight as a higher number for more important tasks
· weight_rule (str) – weighting method used for the effective total priority weight of the task Options are { downstream | upstream | absolute } default is downstream When set to downstream the effective weight of the task is the aggregate sum of all downstream descendants As a result upstream tasks will have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and desire to have all upstream tasks to complete for all runs before each dag can continue processing downstream tasks When set to upstream the effective weight is the aggregate sum of all upstream ancestors This is the opposite where downtream tasks have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and prefer to have each dag complete before starting upstream tasks of other dags When set to absolute the effective weight is the exact priority_weight specified without additional weighting You may want to do this when you know exactly what priority weight each task should have Additionally when set to absolute there is bonus effect of significantly speeding up the task creation process as for very large DAGS Options can be set as string or using the constants defined in the static class airflowutilsWeightRule
· pool (str) – the slot pool this task should run in slot pools are a way to limit concurrency for certain tasks
· sla (datetimetimedelta) – time by which the job is expected to succeed Note that this represents the timedelta after the period is closed For example if you set an SLA of 1 hour the scheduler would send an email soon after 100AM on the 20160102 if the 20160101 instance has not succeeded yet The scheduler pays special attention for jobs with an SLA and sends alert emails for sla misses SLA misses are also recorded in the database for future reference All tasks that share the same SLA time get bundled in a single email sent soon after that time SLA notification are sent once and only once for each task instance
· execution_timeout (datetimetimedelta) – max time allowed for the execution of this task instance if it goes beyond it will raise and fail
· on_failure_callback (callable) – a function to be called when a task instance of this task fails a context dictionary is passed as a single parameter to this function Context contains references to related objects to the task instance and is documented under the macros section of the API
· on_retry_callback (callable) – much like the on_failure_callback except that it is executed when retries occur
· on_success_callback (callable) – much like the on_failure_callback except that it is executed when the task succeeds
· trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered Options are { all_success | all_failed | all_done | one_success | one_failed | dummy} default is all_success Options can be set as string or using the constants defined in the static class airflowutilsTriggerRule
· resources (dict) – A map of resource parameter names (the argument names of the Resources constructor) to their values
· run_as_user (str) – unix username to impersonate while running the task
· task_concurrency (int) – When set a task will be able to limit the concurrent runs across execution_dates
· executor_config (dict) –
Additional tasklevel configuration parameters that are interpreted by a specific executor Parameters are namespaced by the name of executor
Example to run this task in a specific docker container through the KubernetesExecutor
MyOperator(
executor_config{
KubernetesExecutor
{image myCustomDockerImage}
}
)
clear(**kwargs)[source]
Clears the state of task instances associated with the task following the parameters specified
dag
Returns the Operator’s DAG if set otherwise raises an error
deps
Returns the list of dependencies for the operator These differ from execution context dependencies in that they are specific to tasks and can be extendedoverridden by subclasses
downstream_list
@property list of tasks directly downstream
execute(context)[source]
This is the main method to derive when creating an operator Context is the same dictionary used as when rendering jinja templates
Refer to get_template_context for more context
get_direct_relative_ids(upstreamFalse)[source]
Get the direct relative ids to the current task upstream or downstream
get_direct_relatives(upstreamFalse)[source]
Get the direct relatives to the current task upstream or downstream
get_flat_relative_ids(upstreamFalse found_descendantsNone)[source]
Get a flat list of relatives’ ids either upstream or downstream
get_flat_relatives(upstreamFalse)[source]
Get a flat list of relatives either upstream or downstream
get_task_instances(session start_dateNone end_dateNone)[source]
Get a set of task instance related to this task for a specific date range
has_dag()[source]
Returns True if the Operator has been assigned to a DAG
on_kill()[source]
Override this method to cleanup subprocesses when a task instance gets killed Any use of the threading subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind
post_execute(context *args **kwargs)[source]
This hook is triggered right after selfexecute() is called It is passed the execution context and any results returned by the operator
pre_execute(context *args **kwargs)[source]
This hook is triggered right before selfexecute() is called
prepare_template()[source]
Hook that is triggered after the templated fields get replaced by their content If you need your operator to alter the content of the file before the template is rendered it should override this method to do so
render_template(attr content context)[source]
Renders a template either from a file or directly in a field and returns the rendered result
render_template_from_field(attr content context jinja_env)[source]
Renders a template from a field If the field is a string it will simply render the string and return the result If it is a collection or nested set of collections it will traverse the structure and render all strings in it
run(start_dateNone end_dateNone ignore_first_depends_on_pastFalse ignore_ti_stateFalse mark_successFalse)[source]
Run a set of task instances for a date range
schedule_interval
The schedule interval of the DAG always wins over individual tasks so that tasks within a DAG always line up The task still needs a schedule_interval as it may not be attached to a DAG
set_downstream(task_or_task_list)[source]
Set a task or a task list to be directly downstream from the current task
set_upstream(task_or_task_list)[source]
Set a task or a task list to be directly upstream from the current task
upstream_list
@property list of tasks directly upstream
xcom_pull(context task_idsNone dag_idNone keyu'return_value' include_prior_datesNone)[source]
See TaskInstancexcom_pull()
xcom_push(context key value execution_dateNone)[source]
See TaskInstancexcom_push()
BaseSensorOperator
All sensors are derived from BaseSensorOperator All sensors inherit the timeout and poke_interval on top of the BaseOperator attributes
class airflowsensorsbase_sensor_operatorBaseSensorOperator(poke_interval60 timeout604800 soft_failFalse mode'poke' *args **kwargs)[source]
Bases airflowmodelsBaseOperator airflowmodelsSkipMixin
Sensor operators are derived from this class and inherit these attributes
Sensor operators keep executing at a time interval and succeed when a criteria is met and fail if and when they time out
Parameters
· soft_fail (bool) – Set to true to mark the task as SKIPPED on failure
· poke_interval (int) – Time in seconds that the job should wait in between each tries
· timeout (int) – Time in seconds before the task times out and fails
· mode (str) – How the sensor operates Options are { poke | reschedule } default is poke When set to poke the sensor is taking up a worker slot for its whole execution time and sleeps between pokes Use this mode if the expected runtime of the sensor is short or if a short poke interval is requried When set to reschedule the sensor task frees the worker slot when the criteria is not yet met and it’s rescheduled at a later time Use this mode if the expected time until the criteria is met is The poke inteval should be more than one minute to prevent too much load on the scheduler
deps
Adds one additional dependency for all sensor operators that checks if a sensor task instance can be rescheduled
poke(context)[source]
Function that the sensors defined while deriving this class should override
Core Operators
Operators
class airflowoperatorsbash_operatorBashOperator(bash_command xcom_pushFalse envNone output_encoding'utf8' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a Bash script command or set of commands
Parameters
· bash_command (str) – The command set of commands or reference to a bash script (must be sh’) to be executed (templated)
· xcom_push (bool) – If xcom_push is True the last line written to stdout will also be pushed to an XCom when the bash command completes
· env (dict) – If env is not None it must be a mapping that defines the environment variables for the new process these are used instead of inheriting the current process environment which is the default behavior (templated)
· output_encoding (str) – Output encoding of bash command
execute(context)[source]
Execute the bash command in a temporary directory which will be cleaned afterwards
class airflowoperatorspython_operatorBranchPythonOperator(python_callable op_argsNone op_kwargsNone provide_contextFalse templates_dictNone templates_extsNone *args **kwargs)[source]
Bases airflowoperatorspython_operatorPythonOperator airflowmodelsSkipMixin
Allows a workflow to branch or follow a single path following the execution of this task
It derives the PythonOperator and expects a Python function that returns the task_id to follow The task_id returned should point to a task directly downstream from {self} All other branches or directly downstream tasks are marked with a state of skipped so that these paths can’t move forward The skipped states are propageted downstream to allow for the DAG state to fill up and the DAG run’s state to be inferred
Note that using tasks with depends_on_pastTrue downstream from BranchPythonOperator is logically unsound as skipped status will invariably lead to block tasks that depend on their past successes skipped states propagates where all directly upstream tasks are skipped
class airflowoperatorscheck_operatorCheckOperator(sql conn_idNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Performs checks against a db The CheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
Note that Python bool casting evals the following as False
· False
· 0
· Empty string ()
· Empty list ([])
· Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alerts without stopping the progress of the DAG
Note that this is an abstract class and get_db_hook needs to be defined Whereas a get_db_hook is hook that gets a single record from an external source
Parameters
sql (str) – the sql to be executed (templated)
class airflowoperatorsdocker_operatorDockerOperator(image api_versionNone commandNone cpus10 docker_url'unixvarrundockersock' environmentNone force_pullFalse mem_limitNone network_modeNone tls_ca_certNone tls_client_certNone tls_client_keyNone tls_hostnameNone tls_ssl_versionNone tmp_dir'tmpairflow' userNone volumesNone working_dirNone xcom_pushFalse xcom_allFalse docker_conn_idNone dnsNone dns_searchNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a command inside a docker container
A temporary directory is created on the host and mounted into a container to allow storing files that together exceed the default disk size of 10GB in a container The path to the mounted directory can be accessed via the environment variable AIRFLOW_TMP_DIR
If a login to a private registry is required prior to pulling the image a Docker connection needs to be configured in Airflow and the connection ID be provided with the parameter docker_conn_id
Parameters
· image (str) – Docker image from which to create the container If image tag is omitted latest will be used
· api_version (str) – Remote API version Set to auto to automatically detect the server’s version
· command (str or list) – Command to be run in the container (templated)
· cpus (float) – Number of CPUs to assign to the container This value gets multiplied with 1024 See httpsdocsdockercomenginereferencerun#cpushareconstraint
· dns (list of strings) – Docker custom DNS servers
· dns_search (list of strings) – Docker custom DNS search domain
· docker_url (str) – URL of the host running the docker daemon Default is unixvarrundockersock
· environment (dict) – Environment variables to set in the container (templated)
· force_pull (bool) – Pull the docker image on every run Default is False
· mem_limit (float or str) – Maximum amount of memory the container can use Either a float value which represents the limit in bytes or a string like 128m or 1g
· network_mode (str) – Network mode for the container
· tls_ca_cert (str) – Path to a PEMencoded certificate authority to secure the docker connection
· tls_client_cert (str) – Path to the PEMencoded certificate used to authenticate docker client
· tls_client_key (str) – Path to the PEMencoded key used to authenticate docker client
· tls_hostname (str or bool) – Hostname to match against the docker server certificate or False to disable the check
· tls_ssl_version (str) – Version of SSL to use when communicating with docker daemon
· tmp_dir (str) – Mount point inside the container to a temporary directory created on the host by the operator The path is also made available via the environment variable AIRFLOW_TMP_DIR inside the container
· user (int or str) – Default user inside the docker container
· volumes – List of volumes to mount into the container eg ['hostpathcontainerpath' 'hostpath2containerpath2ro']
· working_dir (str) – Working directory to set on the container (equivalent to the w switch the docker client)
· xcom_push (bool) – Does the stdout will be pushed to the next step using XCom The default is False
· xcom_all (bool) – Push all the stdout or just the last line The default is False (last line)
· docker_conn_id (str) – ID of the Airflow connection to use
class airflowoperatorsdummy_operatorDummyOperator(*args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator that does literally nothing It can be used to group tasks in a DAG
class airflowoperatorsdruid_check_operatorDruidCheckOperator(sql druid_broker_conn_id'druid_broker_default' *args **kwargs)[source]
Bases airflowoperatorscheck_operatorCheckOperator
Performs checks against Druid The DruidCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
Note that Python bool casting evals the following as False
· False
· 0
· Empty string ()
· Empty list ([])
· Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
Parameters
· sql (str) – the sql to be executed
· druid_broker_conn_id (str) – reference to the druid broker
get_db_hook()[source]
Return the druid db api hook
get_first(sql)[source]
Executes the druid sql to druid broker and returns the first resulting row
Parameters
sql (str) – the sql statement to be executed (str)
class airflowoperatorsemail_operatorEmailOperator(to subject html_content filesNone ccNone bccNone mime_subtype'mixed' mime_charset'utf8' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Sends an email
Parameters
· to (list or string (comma or semicolon delimited)) – list of emails to send the email to (templated)
· subject (str) – subject line for the email (templated)
· html_content (str) – content of the email html markup is allowed (templated)
· files (list) – file names to attach in email
· cc (list or string (comma or semicolon delimited)) – list of recipients to be added in CC field
· bcc (list or string (comma or semicolon delimited)) – list of recipients to be added in BCC field
· mime_subtype (str) – MIME sub content type
· mime_charset (str) – character set parameter added to the ContentType header
class airflowoperatorsgeneric_transferGenericTransfer(sql destination_table source_conn_id destination_conn_id preoperatorNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from a connection to another assuming that they both provide the required methods in their respective hooks The source hook needs to expose a get_records method and the destination a insert_rows method
This is meant to be used on smallish datasets that fit in memory
Parameters
· sql (str) – SQL query to execute against the source database (templated)
· destination_table (str) – target table (templated)
· source_conn_id (str) – source connection
· destination_conn_id (str) – source connection
· preoperator (str or list of str) – sql statement or list of statements to be executed prior to loading the data (templated)
class airflowoperatorshive_to_druidHiveToDruidTransfer(sql druid_datasource ts_dim metric_specNone hive_cli_conn_id'hive_cli_default' druid_ingest_conn_id'druid_ingest_default' metastore_conn_id'metastore_default' hadoop_dependency_coordinatesNone intervalsNone num_shards1 target_partition_size1 query_granularity'NONE' segment_granularity'DAY' hive_tblpropertiesNone job_propertiesNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Hive to Druid [del]note that for now the data is loaded into memory before being pushed to Druid so this operator should be used for smallish amount of data[del]
Parameters
· sql (str) – SQL query to execute against the Druid database (templated)
· druid_datasource (str) – the datasource you want to ingest into in druid
· ts_dim (str) – the timestamp dimension
· metric_spec (list) – the metrics you want to define for your data
· hive_cli_conn_id (str) – the hive connection id
· druid_ingest_conn_id (str) – the druid ingest connection id
· metastore_conn_id (str) – the metastore connection id
· hadoop_dependency_coordinates (list of str) – list of coordinates to squeeze int the ingest json
· intervals (list) – list of time intervals that defines segments this is passed as is to the json object (templated)
· hive_tblproperties (dict) – additional properties for tblproperties in hive for the staging table
· job_properties (dict) – additional properties for job
construct_ingest_query(static_path columns)[source]
Builds an ingest query for an HDFS TSV load
Parameters
· static_path (str) – The path on hdfs where the data is
· columns (list) – List of all the columns that are available
class airflowoperatorshive_to_mysqlHiveToMySqlTransfer(sql mysql_table hiveserver2_conn_id'hiveserver2_default' mysql_conn_id'mysql_default' mysql_preoperatorNone mysql_postoperatorNone bulk_loadFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Hive to MySQL note that for now the data is loaded into memory before being pushed to MySQL so this operator should be used for smallish amount of data
Parameters
· sql (str) – SQL query to execute against Hive server (templated)
· mysql_table (str) – target MySQL table use dot notation to target a specific database (templated)
· mysql_conn_id (str) – source mysql connection
· hiveserver2_conn_id (str) – destination hive connection
· mysql_preoperator (str) – sql statement to run against mysql prior to import typically use to truncate of delete in place of the data coming in allowing the task to be idempotent (running the task twice won’t double load data) (templated)
· mysql_postoperator (str) – sql statement to run against mysql after the import typically used to move data from staging to production and issue cleanup commands (templated)
· bulk_load (bool) – flag to use bulk_load option This loads mysql directly from a tabdelimited text file using the LOAD DATA LOCAL INFILE command This option requires an extra connection parameter for the destination MySQL connection {local_infile’ true}
class airflowoperatorshive_operatorHiveOperator(hql hive_cli_conn_idu'hive_cli_default' schemau'default' hiveconfsNone hiveconf_jinja_translateFalse script_begin_tagNone run_as_ownerFalse mapred_queueNone mapred_queue_priorityNone mapred_job_nameNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes hql code or hive script in a specific Hive database
Parameters
· hql (str) – the hql to be executed Note that you may also use a relative path from the dag file of a (template) hive script (templated)
· hive_cli_conn_id (str) – reference to the Hive database (templated)
· hiveconfs (dict) – if defined these key value pairs will be passed to hive as hiveconf keyvalue
· hiveconf_jinja_translate (bool) – when True hiveconftype templating {var} gets translated into jinjatype templating {{ var }} and {hiveconfvar} gets translated into jinjatype templating {{ var }} Note that you may want to use this along with the DAG(user_defined_macrosmyargs) parameter View the DAG object documentation for more details
· script_begin_tag (str) – If defined the operator will get rid of the part of the script before the first occurrence of script_begin_tag
· mapred_queue (str) – queue used by the Hadoop CapacityScheduler (templated)
· mapred_queue_priority (str) – priority within CapacityScheduler queue Possible settings include VERY_HIGH HIGH NORMAL LOW VERY_LOW
· mapred_job_name (str) – This name will appear in the jobtracker This can make monitoring easier
class airflowoperatorshive_stats_operatorHiveStatsCollectionOperator(table partition extra_exprsNone col_blacklistNone assignment_funcNone metastore_conn_id'metastore_default' presto_conn_id'presto_default' mysql_conn_id'airflow_db' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Gathers partition statistics using a dynamically generated Presto query inserts the stats into a MySql table with this format Stats overwrite themselves if you rerun the same datepartition
CREATE TABLE hive_stats (
ds VARCHAR(16)
table_name VARCHAR(500)
metric VARCHAR(200)
value BIGINT
)
Parameters
· table (str) – the source table in the format databasetable_name (templated)
· partition (dict of {colvalue}) – the source partition (templated)
· extra_exprs (dict) – dict of expression to run against the table where keys are metric names and values are Presto compatible expressions
· col_blacklist (list) – list of columns to blacklist consider blacklisting blobs large json columns …
· assignment_func (function) – a function that receives a column name and a type and returns a dict of metric names and an Presto expressions If None is returned the global defaults are applied If an empty dictionary is returned no stats are computed for that column
class airflowoperatorscheck_operatorIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 conn_idNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
Note that this is an abstract class and get_db_hook needs to be defined Whereas a get_db_hook is hook that gets a single record from an external source
Parameters
· table (str) – the table name
· days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
· metrics_threshold (dict) – a dictionary of ratios indexed by metrics
class airflowoperatorslatest_only_operatorLatestOnlyOperator(task_id owner'Airflow' emailNone email_on_retryTrue email_on_failureTrue retries0 retry_delaydatetimetimedelta(0 300) retry_exponential_backoffFalse max_retry_delayNone start_dateNone end_dateNone schedule_intervalNone depends_on_pastFalse wait_for_downstreamFalse dagNone paramsNone default_argsNone adhocFalse priority_weight1 weight_ruleu'downstream' queue'default' poolNone slaNone execution_timeoutNone on_failure_callbackNone on_success_callbackNone on_retry_callbackNone trigger_ruleu'all_success' resourcesNone run_as_userNone task_concurrencyNone executor_configNone inletsNone outletsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator airflowmodelsSkipMixin
Allows a workflow to skip tasks that are not running during the most recent schedule interval
If the task is run outside of the latest schedule interval all directly downstream tasks will be skipped
class airflowoperatorsmssql_operatorMsSqlOperator(sql mssql_conn_id'mssql_default' parametersNone autocommitFalse databaseNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes sql code in a specific Microsoft SQL database
Parameters
· sql (str or string pointing to a template file with sql extension (templated)) – the sql code to be executed
· mssql_conn_id (str) – reference to a specific mssql database
· parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
· autocommit (bool) – if True each command is automatically committed (default value False)
· database (str) – name of database which overwrite defined one in connection
class airflowoperatorsmssql_to_hiveMsSqlToHiveTransfer(sql hive_table createTrue recreateFalse partitionNone delimiteru'x01' mssql_conn_id'mssql_default' hive_cli_conn_id'hive_cli_default' tblpropertiesNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Microsoft SQL Server to Hive The operator runs your query against Microsoft SQL Server stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the table gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
Parameters
· sql (str) – SQL query to execute against the Microsoft SQL Server database (templated)
· hive_table (str) – target Hive table use dot notation to target a specific database (templated)
· create (bool) – whether to create the table if it doesn’t exist
· recreate (bool) – whether to drop and recreate the table at every execution
· partition (dict) – target partition as a dict of partition columns and values (templated)
· delimiter (str) – field delimiter in the file
· mssql_conn_id (str) – source Microsoft SQL Server connection
· hive_conn_id (str) – destination hive connection
· tblproperties (dict) – TBLPROPERTIES of the hive table being created
class airflowoperatorsmysql_operatorMySqlOperator(sql mysql_conn_id'mysql_default' parametersNone autocommitFalse databaseNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes sql code in a specific MySQL database
Parameters
· sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
· mysql_conn_id (str) – reference to a specific mysql database
· parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
· autocommit (bool) – if True each command is automatically committed (default value False)
· database (str) – name of database which overwrite defined one in connection
class airflowoperatorsmysql_to_hiveMySqlToHiveTransfer(sql hive_table createTrue recreateFalse partitionNone delimiteru'x01' mysql_conn_id'mysql_default' hive_cli_conn_id'hive_cli_default' tblpropertiesNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from MySql to Hive The operator runs your query against MySQL stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the table gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
Parameters
· sql (str) – SQL query to execute against the MySQL database (templated)
· hive_table (str) – target Hive table use dot notation to target a specific database (templated)
· create (bool) – whether to create the table if it doesn’t exist
· recreate (bool) – whether to drop and recreate the table at every execution
· partition (dict) – target partition as a dict of partition columns and values (templated)
· delimiter (str) – field delimiter in the file
· mysql_conn_id (str) – source mysql connection
· hive_conn_id (str) – destination hive connection
· tblproperties (dict) – TBLPROPERTIES of the hive table being created
class airflowoperatorspig_operatorPigOperator(pig pig_cli_conn_id'pig_cli_default' pigparams_jinja_translateFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes pig script
Parameters
· pig (str) – the pig latin script to be executed (templated)
· pig_cli_conn_id (str) – reference to the Hive database
· pigparams_jinja_translate (bool) – when True pig paramstype templating {var} gets translated into jinjatype templating {{ var }} Note that you may want to use this along with the DAG(user_defined_macrosmyargs) parameter View the DAG object documentation for more details
class airflowoperatorspostgres_operatorPostgresOperator(sql postgres_conn_id'postgres_default' autocommitFalse parametersNone databaseNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes sql code in a specific Postgres database
Parameters
· sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
· postgres_conn_id (str) – reference to a specific postgres database
· autocommit (bool) – if True each command is automatically committed (default value False)
· parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
· database (str) – name of database which overwrite defined one in connection
class airflowoperatorspresto_check_operatorPrestoCheckOperator(sql presto_conn_id'presto_default' *args **kwargs)[source]
Bases airflowoperatorscheck_operatorCheckOperator
Performs checks against Presto The PrestoCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
Note that Python bool casting evals the following as False
· False
· 0
· Empty string ()
· Empty list ([])
· Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
Parameters
· sql (str) – the sql to be executed
· presto_conn_id (str) – reference to the Presto database
class airflowoperatorspresto_check_operatorPrestoIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 presto_conn_id'presto_default' *args **kwargs)[source]
Bases airflowoperatorscheck_operatorIntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
Parameters
· table (str) – the table name
· days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
· metrics_threshold (dict) – a dictionary of ratios indexed by metrics
· presto_conn_id (str) – reference to the Presto database
class airflowoperatorspresto_to_mysqlPrestoToMySqlTransfer(sql mysql_table presto_conn_id'presto_default' mysql_conn_id'mysql_default' mysql_preoperatorNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Presto to MySQL note that for now the data is loaded into memory before being pushed to MySQL so this operator should be used for smallish amount of data
Parameters
· sql (str) – SQL query to execute against Presto (templated)
· mysql_table (str) – target MySQL table use dot notation to target a specific database (templated)
· mysql_conn_id (str) – source mysql connection
· presto_conn_id (str) – source presto connection
· mysql_preoperator (str) – sql statement to run against mysql prior to import typically use to truncate of delete in place of the data coming in allowing the task to be idempotent (running the task twice won’t double load data) (templated)
class airflowoperatorspresto_check_operatorPrestoValueCheckOperator(sql pass_value toleranceNone presto_conn_id'presto_default' *args **kwargs)[source]
Bases airflowoperatorscheck_operatorValueCheckOperator
Performs a simple value check using sql code
Parameters
· sql (str) – the sql to be executed
· presto_conn_id (str) – reference to the Presto database
class airflowoperatorspython_operatorPythonOperator(python_callable op_argsNone op_kwargsNone provide_contextFalse templates_dictNone templates_extsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes a Python callable
Parameters
· python_callable (python callable) – A reference to an object that is callable
· op_kwargs (dict) – a dictionary of keyword arguments that will get unpacked in your function
· op_args (list) – a list of 位置参数 that will get unpacked when calling your callable
· provide_context (bool) – if set to true Airflow will pass a set of keyword arguments that can be used in your function This set of kwargs correspond exactly to what you can use in your jinja templates For this to work you need to define **kwargs in your function header
· templates_dict (dict of str) – a dictionary where the values are templates that will get templated by the Airflow engine sometime between __init__ and execute takes place and are made available in your callable’s context after the template has been applied (templated)
· templates_exts (list(str)) – a list of file extensions to resolve while processing templated fields for examples ['sql' 'hql']
class airflowoperatorspython_operatorPythonVirtualenvOperator(python_callable requirementsNone python_versionNone use_dillFalse system_site_packagesTrue op_argsNone op_kwargsNone string_argsNone templates_dictNone templates_extsNone *args **kwargs)[source]
Bases airflowoperatorspython_operatorPythonOperator
Allows one to run a function in a virtualenv that is created and destroyed automatically (with certain caveats)
The function must be defined using def and not be part of a class All imports must happen inside the function and no variables outside of the scope may be referenced A global scope variable named virtualenv_string_args will be available (populated by string_args) In addition one can pass stuff through op_args and op_kwargs and one can use a return value Note that if your virtualenv runs in a different Python major version than Airflow you cannot use return values op_args or op_kwargs You can use string_args though param python_callable A python function with no references to outside variables
defined with def which will be run in a virtualenv
Parameters
· requirements (list(str)) – A list of requirements as specified in a pip install command
· python_version (str) – The Python version to run the virtualenv with Note that both 2 and 27 are acceptable forms
· use_dill (bool) – Whether to use dill to serialize the args and result (pickle is default) This allow more complex types but requires you to include dill in your requirements
· system_site_packages (bool) – Whether to include system_site_packages in your virtualenv See virtualenv documentation for more information
· op_args – A list of 位置参数 to pass to python_callable
· op_kwargs (dict) – A dict of keyword arguments to pass to python_callable
· string_args (list(str)) – Strings that are present in the global var virtualenv_string_args available to python_callable at runtime as a list(str) Note that args are split by newline
· templates_dict (dict of str) – a dictionary where the values are templates that will get templated by the Airflow engine sometime between __init__ and execute takes place and are made available in your callable’s context after the template has been applied
· templates_exts (list(str)) – a list of file extensions to resolve while processing templated fields for examples ['sql' 'hql']
class airflowoperatorss3_file_transform_operatorS3FileTransformOperator(source_s3_key dest_s3_key transform_scriptNone select_expressionNone source_aws_conn_id'aws_default' source_verifyNone dest_aws_conn_id'aws_default' dest_verifyNone replaceFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copies data from a source S3 location to a temporary location on the local filesystem Runs a transformation on this file as specified by the transformation script and uploads the output to a destination S3 location
The locations of the source and the destination files in the local filesystem is provided as an first and second arguments to the transformation script The transformation script is expected to read the data from source transform it and write the output to the local destination file The operator then takes over control and uploads the local destination file to S3
S3 Select is also available to filter the source contents Users can omit the transformation script if S3 Select expression is specified
Parameters
· source_s3_key (str) – The key to be retrieved from S3 (templated)
· source_aws_conn_id (str) – source s3 connection
· source_verify (bool or str) –
Whether or not to verify SSL certificates for S3 connetion By default SSL certificates are verified You can provide the following values
o False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
o pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
This is also applicable to dest_verify
· dest_s3_key (str) – The key to be written from S3 (templated)
· dest_aws_conn_id (str) – destination s3 connection
· replace (bool) – Replace dest S3 key if it already exists
· transform_script (str) – location of the executable transformation script
· select_expression (str) – S3 Select expression
class airflowoperatorss3_to_hive_operatorS3ToHiveTransfer(s3_key field_dict hive_table delimiter' ' createTrue recreateFalse partitionNone headersFalse check_headersFalse wildcard_matchFalse aws_conn_id'aws_default' verifyNone hive_cli_conn_id'hive_cli_default' input_compressedFalse tblpropertiesNone select_expressionNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from S3 to Hive The operator downloads a file from S3 stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata from
Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the tables gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
Parameters
· s3_key (str) – The key to be retrieved from S3 (templated)
· field_dict (dict) – A dictionary of the fields name in the file as keys and their Hive types as values
· hive_table (str) – target Hive table use dot notation to target a specific database (templated)
· create (bool) – whether to create the table if it doesn’t exist
· recreate (bool) – whether to drop and recreate the table at every execution
· partition (dict) – target partition as a dict of partition columns and values (templated)
· headers (bool) – whether the file contains column names on the first line
· check_headers (bool) – whether the column names on the first line should be checked against the keys of field_dict
· wildcard_match (bool) – whether the s3_key should be interpreted as a Unix wildcard pattern
· delimiter (str) – field delimiter in the file
· aws_conn_id (str) – source s3 connection
· hive_cli_conn_id (str) – destination hive connection
· input_compressed (bool) – Boolean to determine if file decompression is required to process headers
· tblproperties (dict) – TBLPROPERTIES of the hive table being created
· select_expression (str) – S3 Select expression
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
class airflowoperatorss3_to_redshift_operatorS3ToRedshiftTransfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone copy_options() autocommitFalse parametersNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes an COPY command to load files from s3 to Redshift
Parameters
· schema (str) – reference to a specific schema in redshift database
· table (str) – reference to a specific table in redshift database
· s3_bucket (str) – reference to a specific S3 bucket
· s3_key (str) – reference to a specific S3 key
· redshift_conn_id (str) – reference to a specific redshift database
· aws_conn_id (str) – reference to a specific S3 connection
· copy_options (list) – reference to a list of COPY options
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
class airflowoperatorspython_operatorShortCircuitOperator(python_callable op_argsNone op_kwargsNone provide_contextFalse templates_dictNone templates_extsNone *args **kwargs)[source]
Bases airflowoperatorspython_operatorPythonOperator airflowmodelsSkipMixin
Allows a workflow to continue only if a condition is met Otherwise the workflow shortcircuits and downstream tasks are skipped
The ShortCircuitOperator is derived from the PythonOperator It evaluates a condition and shortcircuits the workflow if the condition is False Any downstream tasks are marked with a state of skipped If the condition is True downstream tasks proceed as normal
The condition is determined by the result of python_callable
class airflowoperatorshttp_operatorSimpleHttpOperator(endpoint method'POST' dataNone headersNone response_checkNone extra_optionsNone xcom_pushFalse http_conn_id'http_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Calls an endpoint on an HTTP system to execute an action
Parameters
· http_conn_id (str) – The connection to run the operator against
· endpoint (str) – The relative part of the full url (templated)
· method (str) – The HTTP method to use default POST
· data (For POSTPUT depends on the contenttype parameter for GET a dictionary of keyvalue string pairs) – The data to pass POSTdata in POSTPUT and params in the URL for a GET request (templated)
· headers (a dictionary of string keyvalue pairs) – The HTTP headers to be added to the GET request
· response_check (A lambda or defined function) – A check against the requests’ response object Returns True for pass’ and False otherwise
· extra_options (A dictionary of options where key is string and value depends on the option that's being modified) – Extra options for the requests’ library see the requests’ documentation (options to modify timeout ssl etc)
class airflowoperatorsslack_operatorSlackAPIOperator(slack_conn_idNone tokenNone methodNone api_paramsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Base Slack Operator The SlackAPIPostOperator is derived from this operator In the future additional Slack API Operators will be derived from this class as well
Parameters
· slack_conn_id (str) – Slack connection ID which its password is Slack API token
· token (str) – Slack API token (httpsapislackcomweb)
· method (str) – The Slack API Method to Call (httpsapislackcommethods)
· api_params (dict) – API Method call parameters (httpsapislackcommethods)
construct_api_call_params()[source]
Used by the execute function Allows templating on the source fields of the api_call_params dict before construction
Override in child classes Each SlackAPIOperator child class is responsible for having a construct_api_call_params function which sets selfapi_call_params with a dict of API call parameters (httpsapislackcommethods)
execute(**kwargs)[source]
SlackAPIOperator calls will not fail even if the call is not unsuccessful It should not prevent a DAG from completing in success
class airflowoperatorsslack_operatorSlackAPIPostOperator(channel'#general' username'Airflow' text'No message has been setnHere is a cat video insteadnhttpswwwyoutubecomwatchvJaiyznGQ' icon_url'httpsrawgithubusercontentcomapacheincubatorairflowmasterairflowwwwstaticpin_100jpg' attachmentsNone *args **kwargs)[source]
Bases airflowoperatorsslack_operatorSlackAPIOperator
Posts messages to a slack channel
Parameters
· channel (str) – channel in which to post message on slack name (#general) or ID (C12318391) (templated)
· username (str) – Username that airflow will be posting to Slack as (templated)
· text (str) – message to send to slack (templated)
· icon_url (str) – url to icon used for this message
· attachments (array of hashes) – extra formatting details (templated) see httpsapislackcomdocsattachments
construct_api_call_params()[source]
Used by the execute function Allows templating on the source fields of the api_call_params dict before construction
Override in child classes Each SlackAPIOperator child class is responsible for having a construct_api_call_params function which sets selfapi_call_params with a dict of API call parameters (httpsapislackcommethods)
class airflowoperatorssqlite_operatorSqliteOperator(sql sqlite_conn_id'sqlite_default' parametersNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes sql code in a specific Sqlite database
Parameters
· sql (str or string pointing to a template file File must have a 'sql' extensions) – the sql code to be executed (templated)
· sqlite_conn_id (str) – reference to a specific sqlite database
· parameters (mapping or iterable) – (optional) the parameters to render the SQL query with
class airflowoperatorssubdag_operatorSubDagOperator(**kwargs)[source]
Bases airflowmodelsBaseOperator
This runs a sub dag By convention a sub dag’s dag_id should be prefixed by its parent and a dot As in parentchild
Parameters
· subdag (airflowDAG) – the DAG object to run as a subdag of the current DAG
· dag (airflowDAG) – the parent DAG for the subdag
· executor (airflowexecutors) – the executor for this subdag Default to use SequentialExecutor Please find AIRFLOW74 for more details
class airflowoperatorsdagrun_operatorTriggerDagRunOperator(trigger_dag_id python_callableNone execution_dateNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Triggers a DAG run for a specified dag_id
Parameters
· trigger_dag_id (str) – the dag_id to trigger
· python_callable (python callable) – a reference to a python function that will be called while passing it the context object and a placeholder object obj for your callable to fill and return if you want a DagRun created This obj object contains a run_id and payload attribute that you can modify in your function The run_id should be a unique identifier for that DAG run and the payload has to be a picklable object that will be made available to your tasks while executing that DAG run Your function header should look like def foo(context dag_run_obj)
· execution_date (datetimedatetime) – Execution date for the dag
class airflowoperatorscheck_operatorValueCheckOperator(sql pass_value toleranceNone conn_idNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Performs a simple value check using sql code
Note that this is an abstract class and get_db_hook needs to be defined Whereas a get_db_hook is hook that gets a single record from an external source
Parameters
sql (str) – the sql to be executed (templated)
class airflowoperatorsredshift_to_s3_operatorRedshiftToS3Transfer(schema table s3_bucket s3_key redshift_conn_id'redshift_default' aws_conn_id'aws_default' verifyNone unload_options() autocommitFalse include_headerFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes an UNLOAD command to s3 as a CSV with headers
Parameters
· schema (str) – reference to a specific schema in redshift database
· table (str) – reference to a specific table in redshift database
· s3_bucket (str) – reference to a specific S3 bucket
· s3_key (str) – reference to a specific S3 key
· redshift_conn_id (str) – reference to a specific redshift database
· aws_conn_id (str) – reference to a specific S3 connection
· unload_options (list) – reference to a list of UNLOAD options
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values
· False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
Sensors
class airflowsensorsexternal_task_sensorExternalTaskSensor(external_dag_id external_task_id allowed_statesNone execution_deltaNone execution_date_fnNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a task to complete in a different DAG
Parameters
· external_dag_id (str) – The dag_id that contains the task you want to wait for
· external_task_id (str) – The task_id that contains the task you want to wait for
· allowed_states (list) – list of allowed states default is ['success']
· execution_delta (datetimetimedelta) – time difference with the previous execution to look at the default is the same execution_date as the current task For yesterday use [positive] datetimetimedelta(days1) Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor but not both
· execution_date_fn (callable) – function that receives the current execution date and returns the desired execution dates to query Either execution_delta or execution_date_fn can be passed to ExternalTaskSensor but not both
poke(**kwargs)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorshdfs_sensorHdfsSensor(filepath hdfs_conn_id'hdfs_default' ignored_extNone ignore_copyingTrue file_sizeNone hook *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a file or folder to land in HDFS
static filter_for_filesize(result sizeNone)[source]
Will test the filepath result and test if its size is at least selffilesize
Parameters
· result – a list of dicts returned by Snakebite ls
· size – the file size in MB a file should be at least to trigger True
Returns
(bool) depending on the matching criteria
static filter_for_ignored_ext(result ignored_ext ignore_copying)[source]
Will filter if instructed to do so the result to remove matching criteria
Parameters
· result – (list) of dicts returned by Snakebite ls
· ignored_ext – (list) of ignored extensions
· ignore_copying – (bool) shall we ignore
Returns
(list) of dicts which were not removed
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorshive_partition_sensorHivePartitionSensor(table partitionds'{{ ds }}' metastore_conn_id'metastore_default' schema'default' poke_interval180 *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a partition to show up in Hive
Note Because partition supports general logical operators it can be inefficient Consider using NamedHivePartitionSensor instead if you don’t need the full flexibility of HivePartitionSensor
Parameters
· table (str) – The name of the table to wait for supports the dot notation (my_databasemy_table)
· partition (str) – The partition clause to wait for This is passed as is to the metastore Thrift client get_partitions_by_filter method and apparently supports SQL like notation as in ds'20150101' AND type'value' and comparison operators as in ds>20150101
· metastore_conn_id (str) – reference to the metastore thrift service connection id
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorshttp_sensorHttpSensor(endpoint http_conn_id'http_default' method'GET' request_paramsNone headersNone response_checkNone extra_optionsNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Executes a HTTP GET statement and returns False on failure caused by 404 Not Found or response_check returning False
HTTP Error codes other than 404 (like 403) or Connection Refused Error would fail the sensor itself directly (no more poking)
Parameters
· http_conn_id (str) – The connection to run the sensor against
· method (str) – The HTTP request method to use
· endpoint (str) – The relative part of the full url
· request_params (a dictionary of string keyvalue pairs) – The parameters to be added to the GET url
· headers (a dictionary of string keyvalue pairs) – The HTTP headers to be added to the GET request
· response_check (A lambda or defined function) – A check against the requests’ response object Returns True for pass’ and False otherwise
· extra_options (A dictionary of options where key is string and value depends on the option that's being modified) – Extra options for the requests’ library see the requests’ documentation (options to modify timeout ssl etc)
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorsmetastore_partition_sensorMetastorePartitionSensor(table partition_name schema'default' mysql_conn_id'metastore_mysql' *args **kwargs)[source]
Bases airflowsensorssql_sensorSqlSensor
An alternative to the HivePartitionSensor that talk directly to the MySQL db This was created as a result of observing sub optimal queries generated by the Metastore thrift service when hitting subpartitioned tables The Thrift service’s queries were written in a way that wouldn’t leverage the indexes
Parameters
· schema (str) – the schema
· table (str) – the table
· partition_name (str) – the partition name as defined in the PARTITIONS table of the Metastore Order of the fields does matter Examples ds20160101 or ds20160101subfoo for a sub partitioned table
· mysql_conn_id (str) – a reference to the MySQL conn_id for the metastore
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorsnamed_hive_partition_sensorNamedHivePartitionSensor(partition_names metastore_conn_id'metastore_default' poke_interval180 hookNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a set of partitions to show up in Hive
Parameters
· partition_names (list of strings) – List of fully qualified names of the partitions to wait for A fully qualified name is of the form schematablepk1pv1pk2pv2 for example defaultusersds20160101 This is passed as is to the metastore Thrift client get_partitions_by_name method Note that you cannot use logical or comparison operators as in HivePartitionSensor
· metastore_conn_id (str) – reference to the metastore thrift service connection id
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorss3_key_sensorS3KeySensor(bucket_key bucket_nameNone wildcard_matchFalse aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a key (a filelike instance on S3) to be present in a S3 bucket S3 being a keyvalue it does not support folders The path is just a key a resource
Parameters
· bucket_key (str) – The key being waited on Supports full s3 style url or relative path from root level When it’s specified as a full s3 url please leave bucket_name as None
· bucket_name (str) – Name of the S3 bucket Only needed when bucket_key is not provided as a full s3 url
· wildcard_match (bool) – whether the bucket_key should be interpreted as a Unix wildcard pattern
· aws_conn_id (str) – a reference to the s3 connection
· verify (bool or str) –
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
o pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorss3_prefix_sensorS3PrefixSensor(bucket_name prefix delimiter'' aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a prefix to exist A prefix is the first part of a key thus enabling checking of constructs similar to glob airfl* or SQL LIKE airfl’ There is the possibility to precise a delimiter to indicate the hierarchy or keys meaning that the match will stop at that delimiter Current code accepts sane delimiters ie characters that are NOT special characters in the Python regex engine
Parameters
· bucket_name (str) – Name of the S3 bucket
· prefix (str) – The prefix being waited on Relative path from bucket root level
· delimiter (str) – The delimiter intended to show hierarchy Defaults to ’
· aws_conn_id (str) – a reference to the s3 connection
· verify (bool or str) –
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
o pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorssql_sensorSqlSensor(conn_id sql *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Runs a sql statement until a criteria is met It will keep trying while sql returns no row or if the first cell in (0 0’ ’)
Parameters
· conn_id (str) – The connection to run the sensor against
· sql – The sql to run To pass it needs to return at least one cell that contains a nonzero empty string value
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorstime_sensorTimeSensor(target_time *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits until the specified time of the day
Parameters
target_time (datetimetime) – time after which the job succeeds
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorstime_delta_sensorTimeDeltaSensor(delta *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a timedelta after the task’s execution_date + schedule_interval In Airflow the daily task stamped with execution_date 20160101 can only start running on 20160102 The timedelta here represents the time after the execution period has closed
Parameters
delta (datetimetimedelta) – time length to wait after execution_date before succeeding
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowsensorsweb_hdfs_sensorWebHdfsSensor(filepath webhdfs_conn_id'webhdfs_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a file or folder to land in HDFS
poke(context)[source]
Function that the sensors defined while deriving this class should override
Communitycontributed Operators
Operators
class airflowcontriboperatorsawsbatch_operatorAWSBatchOperator(job_name job_definition job_queue overrides max_retries4200 aws_conn_idNone region_nameNone **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a job on AWS Batch Service
Parameters
· job_name (str) – the name for the job that will run on AWS Batch (templated)
· job_definition (str) – the job definition name on AWS Batch
· job_queue (str) – the queue name on AWS Batch
· overrides (dict) – the same parameter that boto3 will receive on containerOverrides (templated) httpboto3readthedocsioenlatestreferenceservicesbatchhtml#submit_job
· max_retries (int) – exponential backoff retries while waiter is not merged 4200 48 hours
· aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
· region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
class airflowcontriboperatorsbigquery_check_operatorBigQueryCheckOperator(sql bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
Bases airflowoperatorscheck_operatorCheckOperator
Performs checks against BigQuery The BigQueryCheckOperator expects a sql query that will return a single row Each value on that first row is evaluated using python bool casting If any of the values return False the check is failed and errors out
Note that Python bool casting evals the following as False
· False
· 0
· Empty string ()
· Empty list ([])
· Empty dictionary or set ({})
Given a query like SELECT COUNT(*) FROM foo it will fail only if the count 0 You can craft much more complex query that could for instance check that the table has the same number of rows as the source table upstream or that the count of today’s partition is greater than yesterday’s partition or that a set of metrics are less than 3 standard deviation for the 7 day average
This operator can be used as a data quality check in your pipeline and depending on where you put it in your DAG you have the choice to stop the critical path preventing from publishing dubious data or on the side and receive email alterts without stopping the progress of the DAG
Parameters
· sql (str) – the sql to be executed
· bigquery_conn_id (str) – reference to the BigQuery database
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
class airflowcontriboperatorsbigquery_check_operatorBigQueryValueCheckOperator(sql pass_value toleranceNone bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
Bases airflowoperatorscheck_operatorValueCheckOperator
Performs a simple value check using sql code
Parameters
· sql (str) – the sql to be executed
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
class airflowcontriboperatorsbigquery_check_operatorBigQueryIntervalCheckOperator(table metrics_thresholds date_filter_column'ds' days_back7 bigquery_conn_id'bigquery_default' use_legacy_sqlTrue *args **kwargs)[source]
Bases airflowoperatorscheck_operatorIntervalCheckOperator
Checks that the values of metrics given as SQL expressions are within a certain tolerance of the ones from days_back before
This method constructs a query like so
SELECT {metrics_threshold_dict_key} FROM {table}
WHERE {date_filter_column}
Parameters
· table (str) – the table name
· days_back (int) – number of days between ds and the ds we want to check against Defaults to 7 days
· metrics_threshold (dict) – a dictionary of ratios indexed by metrics for example COUNT(*)’ 15 would require a 50 percent or less difference between the current day and the prior days_back
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
class airflowcontriboperatorsbigquery_get_dataBigQueryGetDataOperator(dataset_id table_id max_results'100' selected_fieldsNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Fetches the data from a BigQuery table (alternatively fetch data for selected columns) and returns data in a python list The number of elements in the returned list will be equal to the number of rows fetched Each element in the list will again be a list where element would represent the columns values for that row
Example Result [['Tony' '10'] ['Mike' '20'] ['Steve' '15']]
Note
If you pass fields to selected_fields which are in different order than the order of columns already in BQ table the data will still be in the order of BQ table For example if the BQ table has 3 columns as [ABC] and you pass BA’ in the selected_fields the data would still be of the form 'AB'
Example
get_data BigQueryGetDataOperator(
task_id'get_data_from_bq'
dataset_id'test_dataset'
table_id'Transaction_partitions'
max_results'100'
selected_fields'DATE'
bigquery_conn_id'airflowserviceaccount'
)
Parameters
· dataset_id (str) – The dataset ID of the requested table (templated)
· table_id (str) – The table ID of the requested table (templated)
· max_results (str) – The maximum number of records (rows) to be fetched from the table (templated)
· selected_fields (str) – List of fields to return (commaseparated) If unspecified all fields are returned
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyTableOperator(dataset_id table_id project_idNone schema_fieldsNone gcs_schema_objectNone time_partitioningNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a new empty table in the specified BigQuery dataset optionally with schema
The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it You can also create a table without schema
Parameters
· project_id (str) – The project to create the table into (templated)
· dataset_id (str) – The dataset to create the table into (templated)
· table_id (str) – The Name of the table to be created (templated)
· schema_fields (list) –
If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
Example
schema_fields[{name emp_name type STRING mode REQUIRED}
{name salary type INTEGER mode NULLABLE}]
· gcs_schema_object (str) – Full path to the JSON file containing schema (templated) For example gstestbucketdir1dir2employee_schemajson
· time_partitioning (dict) –
configure optional time partitioning fields ie partition by field type and expiration as per API specifications
See also
httpscloudgooglecombigquerydocsreferencerestv2tables#timePartitioning
· bigquery_conn_id (str) – Reference to a specific BigQuery hook
· google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· labels (dict) –
a dictionary containing labels for the table passed to BigQuery
Example (with schema JSON in GCS)
CreateTable BigQueryCreateEmptyTableOperator(
task_id'BigQueryCreateEmptyTableOperator_task'
dataset_id'ODS'
table_id'Employees'
project_id'internalgcpproject'
gcs_schema_object'gsschemabucketemployee_schemajson'
bigquery_conn_id'airflowserviceaccount'
google_cloud_storage_conn_id'airflowserviceaccount'
)
Corresponding Schema file (employee_schemajson)
[
{
mode NULLABLE
name emp_name
type STRING
}
{
mode REQUIRED
name salary
type INTEGER
}
]
Example (with schema in the DAG)
CreateTable BigQueryCreateEmptyTableOperator(
task_id'BigQueryCreateEmptyTableOperator_task'
dataset_id'ODS'
table_id'Employees'
project_id'internalgcpproject'
schema_fields[{name emp_name type STRING mode REQUIRED}
{name salary type INTEGER mode NULLABLE}]
bigquery_conn_id'airflowserviceaccount'
google_cloud_storage_conn_id'airflowserviceaccount'
)
class airflowcontriboperatorsbigquery_operatorBigQueryCreateExternalTableOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' skip_leading_rows0 field_delimiter' ' max_bad_records0 quote_characterNone allow_quoted_newlinesFalse allow_jagged_rowsFalse bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone src_fmt_configs{} labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a new external table in the dataset with the data in Google Cloud Storage
The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
Parameters
· bucket (str) – The bucket to point the external table to (templated)
· source_objects (list) – List of Google cloud storage URIs to point table to (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
· destination_project_dataset_table (str) – The dotted ()
BigQuery table to load data into (templated) If is not included project will be the project defined in the connection json
· schema_fields (list) –
If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencerestv2jobs#configurationloadschema
Example
schema_fields[{name emp_name type STRING mode REQUIRED}
{name salary type INTEGER mode NULLABLE}]
Should not be set when source_format is DATASTORE_BACKUP’
· schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
· source_format (str) – File format of the data
· compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
· skip_leading_rows (int) – Number of rows to skip when loading from a CSV
· field_delimiter (str) – The delimiter to use for the CSV
· max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
· quote_character (str) – The value that is used to quote data sections in a CSV file
· allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
· allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
· bigquery_conn_id (str) – Reference to a specific BigQuery hook
· google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· src_fmt_configs (dict) – configure optional fields specific to the source format
param labels a dictionary containing labels for the table passed to BigQuery type labels dict
class airflowcontriboperatorsbigquery_operatorBigQueryDeleteDatasetOperator(dataset_id project_idNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
This operator deletes an existing dataset from your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasetsdelete param project_id The project id of the dataset type project_id str param dataset_id The dataset to be deleted type dataset_id str
Example
delete_temp_data BigQueryDeleteDatasetOperator(dataset_id 'tempdataset'
project_id 'tempproject'
bigquery_conn_id'_my_gcp_conn_'
task_id'Deletetemp'
dagdag)
class airflowcontriboperatorsbigquery_operatorBigQueryCreateEmptyDatasetOperator(dataset_id project_idNone dataset_referenceNone bigquery_conn_id'bigquery_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
This operator is used to create new dataset for your Project in Big query httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
Parameters
· project_id (str) – The name of the project where we want to create the dataset Don’t need to provide if projectId in dataset_reference
· dataset_id (str) – The id of dataset Don’t need to provide if datasetId in dataset_reference
· dataset_reference – Dataset reference that could be provided with request body More info httpscloudgooglecombigquerydocsreferencerestv2datasets#resource
class airflowcontriboperatorsbigquery_operatorBigQueryOperator(sqlNone destination_dataset_tableFalse write_disposition'WRITE_EMPTY' allow_large_resultsFalse flatten_resultsNone bigquery_conn_id'bigquery_default' delegate_toNone udf_configFalse use_legacy_sqlTrue maximum_billing_tierNone maximum_bytes_billedNone create_disposition'CREATE_IF_NEEDED' schema_update_options() query_paramsNone labelsNone priority'INTERACTIVE' time_partitioningNone api_resource_configsNone cluster_fieldsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes BigQuery SQL queries in a specific BigQuery database
Parameters
· sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
· destination_dataset_table (str) – A dotted (|)
that if set will store the results of the query (templated)
· write_disposition (str) – Specifies the action that occurs if the destination table already exists (default WRITE_EMPTY’)
· create_disposition (str) – Specifies whether the job is allowed to create new tables (default CREATE_IF_NEEDED’)
· allow_large_results (bool) – Whether to allow large results
· flatten_results (bool) – If true and query uses legacy SQL dialect flattens all nested and repeated fields in the query results allow_large_results must be true if this is set to false For standard SQL queries this flag is ignored and results are never flattened
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· udf_config (list) – The User Defined Function configuration for the query See httpscloudgooglecombigqueryuserdefinedfunctions for details
· use_legacy_sql (bool) – Whether to use legacy SQL (true) or standard SQL (false)
· maximum_billing_tier (int) – Positive integer that serves as a multiplier of the basic price Defaults to None in which case it uses the value set in the project
· maximum_bytes_billed (float) – Limits the bytes billed for this job Queries that will have bytes billed beyond this limit will fail (without incurring a charge) If unspecified this will be set to your project default
· api_resource_configs (dict) – a dictionary that contain params configuration’ applied for Google BigQuery Jobs API httpscloudgooglecombigquerydocsreferencerestv2jobs for example {query’ {useQueryCache’ False}} You could use it if you need to provide some params that are not supported by BigQueryOperator like args
· schema_update_options (tuple) – Allows the schema of the destination table to be updated as a side effect of the load job
· query_params (dict) – a dictionary containing query parameter types and values passed to BigQuery
· labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
· priority (str) – Specifies a priority for the query Possible values include INTERACTIVE and BATCH The default value is INTERACTIVE
· time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications
· cluster_fields (list of str) – Request that the result of this query be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order
class airflowcontriboperatorsbigquery_table_delete_operatorBigQueryTableDeleteOperator(deletion_dataset_table bigquery_conn_id'bigquery_default' delegate_toNone ignore_if_missingFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Deletes BigQuery tables
Parameters
· deletion_dataset_table (str) – A dotted (|)
that indicates which table will be deleted (templated)
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· ignore_if_missing (bool) – if True then return success even if the requested table does not exist
class airflowcontriboperatorsbigquery_to_bigqueryBigQueryToBigQueryOperator(source_project_dataset_tables destination_project_dataset_table write_disposition'WRITE_EMPTY' create_disposition'CREATE_IF_NEEDED' bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copies data from one BigQuery table to another
See also
For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs#configurationcopy
Parameters
· source_project_dataset_tables (list|string) – One or more dotted (project|project)
BigQuery tables to use as the source data If is not included project will be the project defined in the connection json Use a list if there are multiple source tables (templated)
· destination_project_dataset_table (str) – The destination BigQuery table Format is (project|project)
(templated)
· write_disposition (str) – The write disposition if the table already exists
· create_disposition (str) – The create disposition if the table doesn’t exist
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
class airflowcontriboperatorsbigquery_to_gcsBigQueryToCloudStorageOperator(source_project_dataset_table destination_cloud_storage_uris compression'NONE' export_format'CSV' field_delimiter' ' print_headerTrue bigquery_conn_id'bigquery_default' delegate_toNone labelsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Transfers a BigQuery table to a Google Cloud Storage bucket
See also
For more details about these parameters httpscloudgooglecombigquerydocsreferencev2jobs
Parameters
· source_project_dataset_table (str) – The dotted (|)
BigQuery table to use as the source data If is not included project will be the project defined in the connection json (templated)
· destination_cloud_storage_uris (list) – The destination Google Cloud Storage URI (eg gssomebucketsomefiletxt) (templated) Follows convention defined here httpscloudgooglecombigqueryexportingdatafrombigquery#exportingmultiple
· compression (str) – Type of compression to use
· export_format (str) – File format to export
· field_delimiter (str) – The delimiter to use when extracting to a CSV
· print_header (bool) – Whether to print a header for a CSV file extract
· bigquery_conn_id (str) – reference to a specific BigQuery hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· labels (dict) – a dictionary containing labels for the jobquery passed to BigQuery
class airflowcontriboperatorscassandra_to_gcsCassandraToGoogleCloudStorageOperator(cql bucket filename schema_filenameNone approx_max_file_size_bytes1900000000 cassandra_conn_idu'cassandra_default' google_cloud_storage_conn_idu'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copy data from Cassandra to Google cloud storage in JSON format
Note Arrays of arrays are not supported
classmethod convert_map_type(name value)[source]
Converts a map to a repeated RECORD that contains two fields key’ and value’ each will be converted to its corresopnding data type in BQ
classmethod convert_tuple_type(name value)[source]
Converts a tuple to RECORD that contains n fields each will be converted to its corresponding data type in bq and will be named field_’ where index is determined by the order of the tuple elments defined in cassandra
classmethod convert_user_type(name value)[source]
Converts a user type to RECORD that contains n fields where n is the number of attributes Each element in the user type class will be converted to its corresponding data type in BQ
class airflowcontriboperatorsdatabricks_operatorDatabricksSubmitRunOperator(jsonNone spark_jar_taskNone notebook_taskNone new_clusterNone existing_cluster_idNone librariesNone run_nameNone timeout_secondsNone databricks_conn_id'databricks_default' polling_period_seconds30 databricks_retry_limit3 databricks_retry_delay1 do_xcom_pushFalse **kwargs)[source]
Bases airflowmodelsBaseOperator
Submits a Spark job run to Databricks using the api20jobsrunssubmit API endpoint
There are two ways to instantiate this operator
In the first way you can take the JSON payload that you typically use to call the api20jobsrunssubmit endpoint and pass it directly to our DatabricksSubmitRunOperator through the json parameter For example
json {
'new_cluster' {
'spark_version' '210db3scala211'
'num_workers' 2
}
'notebook_task' {
'notebook_path' 'Usersairflow@examplecomPrepareData'
}
}
notebook_run DatabricksSubmitRunOperator(task_id'notebook_run' jsonjson)
Another way to accomplish the same thing is to use the named parameters of the DatabricksSubmitRunOperator directly Note that there is exactly one named parameter for each top level parameter in the runssubmit endpoint In this method your code would look like this
new_cluster {
'spark_version' '210db3scala211'
'num_workers' 2
}
notebook_task {
'notebook_path' 'Usersairflow@examplecomPrepareData'
}
notebook_run DatabricksSubmitRunOperator(
task_id'notebook_run'
new_clusternew_cluster
notebook_tasknotebook_task)
In the case where both the json parameter AND the named parameters are provided they will be merged together If there are conflicts during the merge the named parameters will take precedence and override the top level json keys
Currently the named parameters that DatabricksSubmitRunOperator supports are
· spark_jar_task
· notebook_task
· new_cluster
· existing_cluster_id
· libraries
· run_name
· timeout_seconds
Parameters
· json (dict) –
A JSON object containing API parameters which will be passed directly to the api20jobsrunssubmit endpoint The other named parameters (ie spark_jar_task notebook_task) to this operator will be merged with this json dictionary if they are provided If there are conflicts during the merge the named parameters will take precedence and override the top level json keys (templated)
See also
For more information about templating see Jinja Templating httpsdocsdatabrickscomapilatestjobshtml#runssubmit
· spark_jar_task (dict) –
The main class and parameters for the JAR task Note that the actual JAR is specified in the libraries EITHER spark_jar_task OR notebook_task should be specified This field will be templated
See also
httpsdocsdatabrickscomapilatestjobshtml#jobssparkjartask
· notebook_task (dict) –
The notebook path and parameters for the notebook task EITHER spark_jar_task OR notebook_task should be specified This field will be templated
See also
httpsdocsdatabrickscomapilatestjobshtml#jobsnotebooktask
· new_cluster (dict) –
Specs for a new cluster on which this task will be run EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
See also
httpsdocsdatabrickscomapilatestjobshtml#jobsclusterspecnewcluster
· existing_cluster_id (str) – ID for existing cluster on which to run this task EITHER new_cluster OR existing_cluster_id should be specified This field will be templated
· libraries (list of dicts) –
Libraries which this run will use This field will be templated
See also
httpsdocsdatabrickscomapilatestlibrarieshtml#managedlibrarieslibrary
· run_name (str) – The run name used for this task By default this will be set to the Airflow task_id This task_id is a required parameter of the superclass BaseOperator This field will be templated
· timeout_seconds (int32) – The timeout for this run By default a value of 0 is used which means to have no timeout This field will be templated
· databricks_conn_id (str) – The name of the Airflow connection to use By default and in the common case this will be databricks_default To use token based authentication provide the key token in the extra field for the connection
· polling_period_seconds (int) – Controls the rate which we poll for the result of this run By default the operator will poll every 30 seconds
· databricks_retry_limit (int) – Amount of times retry if the Databricks backend is unreachable Its value must be greater than or equal to 1
· databricks_retry_delay (float) – Number of seconds to wait between retries (it might be a floating point number)
· do_xcom_push (bool) – Whether we should push run_id and run_page_url to xcom
class airflowcontriboperatorsdataflow_operatorDataFlowJavaOperator(jar job_name'{{tasktask_id}}' dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 job_classNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Java Cloud DataFlow batch job The parameters of the operation will be passed to the job
See also
For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
Parameters
· jar (str) – The reference to a self executing DataFlow jar (templated)
· job_name (str) – The jobName’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' in options will be overwritten
· dataflow_default_options (dict) – Map of default job options
· options (dict) – Map of job specific options
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
· job_class (str) – The name of the dataflow job class to be executued it is often not the main class configured in the dataflow jar file
jar options and job_name are templated so you can use variables in them
Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
default_args {
'dataflow_default_options' {
'project' 'mygcpproject'
'zone' 'europewest1d'
'stagingLocation' 'gsmystagingbucketstaging'
}
}
You need to pass the path to your dataflow as a file reference with the jar parameter the jar needs to be a self executing jar (see documentation here httpsbeamapacheorgdocumentationrunnersdataflow#selfexecutingjar) Use options to pass on options to your job
t1 DataFlowJavaOperator(
task_id'datapflow_example'
jar'{{varvaluegcp_dataflow_base}}pipelinebuildlibspipelineexample10jar'
options{
'autoscalingAlgorithm' 'BASIC'
'maxNumWorkers' '50'
'start' '{{ds}}'
'partitionType' 'DAY'
'labels' {'foo' 'bar'}
}
gcp_conn_id'gcpairflowserviceaccount'
dagmydag)
class airflowcontriboperatorsdataflow_operatorDataflowTemplateOperator(template job_name'{{tasktask_id}}' dataflow_default_optionsNone parametersNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Templated Cloud DataFlow batch job The parameters of the operation will be passed to the job
Parameters
· template (str) – The reference to the DataFlow template
· job_name – The jobName’ to use when executing the DataFlow template (templated)
· dataflow_default_options (dict) – Map of default job environment options
· parameters (dict) – Map of job specific parameters for the template
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
It’s a good practice to define dataflow_* parameters in the default_args of the dag like the project zone and staging location
See also
httpscloudgooglecomdataflowdocsreferencerestv1b3LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment
default_args {
'dataflow_default_options' {
'project' 'mygcpproject'
'zone' 'europewest1d'
'tempLocation' 'gsmystagingbucketstaging'
}
}
}
You need to pass the path to your dataflow template as a file reference with the template parameter Use parameters to pass on parameters to your job Use environment to pass on runtime environment variables to your job
t1 DataflowTemplateOperator(
task_id'datapflow_example'
template'{{varvaluegcp_dataflow_base}}'
parameters{
'inputFile' gsbucketinputmy_inputtxt
'outputFile' gsbucketoutputmy_outputtxt
}
gcp_conn_id'gcpairflowserviceaccount'
dagmydag)
template dataflow_default_options parameters and job_name are templated so you can use variables in them
Note that dataflow_default_options is expected to save highlevel options for project information which apply to all dataflow operators in the DAG
See also
httpscloudgooglecomdataflowdocsreferencerestv1b3 LaunchTemplateParameters httpscloudgooglecomdataflowdocsreferencerestv1b3RuntimeEnvironment For more detail on job template execution have a look at the reference httpscloudgooglecomdataflowdocstemplatesexecutingtemplates
class airflowcontriboperatorsdataflow_operatorDataFlowPythonOperator(py_file job_name'{{tasktask_id}}' py_optionsNone dataflow_default_optionsNone optionsNone gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10 *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Launching Cloud Dataflow jobs written in python Note that both dataflow_default_options and options will be merged to specify pipeline execution parameter and dataflow_default_options is expected to save highlevel options for instances project and zone information which apply to all dataflow operators in the DAG
See also
For more detail on job submission have a look at the reference httpscloudgooglecomdataflowpipelinesspecifyingexecparams
Parameters
· py_file (str) – Reference to the python dataflow pipleline filepy eg somelocalfilepathtoyourpythonpipelinefile
· job_name (str) – The job_name’ to use when executing the DataFlow job (templated) This ends up being set in the pipeline options so any entry with key 'jobName' or 'job_name' in options will be overwritten
· py_options – Additional python options
· dataflow_default_options (dict) – Map of default job options
· options (dict) – Map of job specific options
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· poll_sleep (int) – The time in seconds to sleep between polling Google Cloud Platform for the dataflow job status while the job is in the JOB_STATE_RUNNING state
execute(context)[source]
Execute the python dataflow job
class airflowcontriboperatorsdataproc_operatorDataprocClusterCreateOperator(cluster_name project_id num_workers zone network_uriNone subnetwork_uriNone internal_ip_onlyNone tagsNone storage_bucketNone init_actions_urisNone init_action_timeout'10m' metadataNone custom_imageNone image_versionNone propertiesNone master_machine_type'n1standard4' master_disk_type'pdstandard' master_disk_size500 worker_machine_type'n1standard4' worker_disk_type'pdstandard' worker_disk_size500 num_preemptible_workers0 labelsNone region'global' gcp_conn_id'google_cloud_default' delegate_toNone service_accountNone service_account_scopesNone idle_delete_ttlNone auto_delete_timeNone auto_delete_ttlNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Create a new cluster on Google Cloud Dataproc The operator will wait until the creation is successful or an error occurs in the creation process
The parameters allow to configure the cluster Please refer to
httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters
for a detailed explanation on the different parameters Most of the configuration parameters detailed in the link are available as a parameter to this operator
Parameters
· cluster_name (str) – The name of the DataProc cluster to create (templated)
· project_id (str) – The ID of the google cloud project in which to create the cluster (templated)
· num_workers (int) – The # of workers to spin up If set to zero will spin up cluster in a single node mode
· storage_bucket (str) – The storage bucket to use setting to None lets dataproc generate a custom one for you
· init_actions_uris (list[string]) – List of GCS uri’s containing dataproc initialization scripts
· init_action_timeout (str) – Amount of time executable scripts in init_actions_uris has to complete
· metadata (dict) – dict of keyvalue google compute engine metadata entries to add to all instances
· image_version (str) – the version of software inside the Dataproc cluster
· custom_image – custom Dataproc image for more info see httpscloudgooglecomdataprocdocsguidesdataprocimages
· properties (dict) – dict of properties to set on config files (eg sparkdefaultsconf) see httpscloudgooglecomdataprocdocsreferencerestv1projectsregionsclusters#SoftwareConfig
· master_machine_type (str) – Compute engine machine type to use for the master node
· master_disk_type (str) – Type of the boot disk for the master node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
· master_disk_size (int) – Disk size for the master node
· worker_machine_type (str) – Compute engine machine type to use for the worker nodes
· worker_disk_type (str) – Type of the boot disk for the worker node (default is pdstandard) Valid values pdssd (Persistent Disk Solid State Drive) or pdstandard (Persistent Disk Hard Disk Drive)
· worker_disk_size (int) – Disk size for the worker nodes
· num_preemptible_workers (int) – The # of preemptible worker nodes to spin up
· labels (dict) – dict of labels to add to the cluster
· zone (str) – The zone where the cluster will be located (templated)
· network_uri (str) – The network uri to be used for machine communication cannot be specified with subnetwork_uri
· subnetwork_uri (str) – The subnetwork uri to be used for machine communication cannot be specified with network_uri
· internal_ip_only (bool) – If true all instances in the cluster will only have internal IP addresses This can only be enabled for subnetwork enabled networks
· tags (list[string]) – The GCE tags to add to all instances
· region (str) – leave as global’ might become relevant in the future (templated)
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· service_account (str) – The service account of the dataproc instances
· service_account_scopes (list[string]) – The URIs of service account scopes to be included
· idle_delete_ttl (int) – The longest duration that cluster would keep alive while staying idle Passing this threshold will cause cluster to be autodeleted A duration in seconds
· auto_delete_time (datetimedatetime) – The time when cluster will be autodeleted
· auto_delete_ttl (int) – The life duration of cluster the cluster will be autodeleted at the end of this duration A duration in seconds (If auto_delete_time is set this parameter will be ignored)
Type
custom_image str
class airflowcontriboperatorsdataproc_operatorDataprocClusterScaleOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone num_workers2 num_preemptible_workers0 graceful_decommission_timeoutNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Scale up or down a cluster on Google Cloud Dataproc The operator will wait until the cluster is rescaled
Example
t1 DataprocClusterScaleOperator(
task_id'dataproc_scale'
project_id'myproject'
cluster_name'cluster1'
num_workers10
num_preemptible_workers10
graceful_decommission_timeout'1h'
dagdag)
See also
For more detail on about scaling clusters have a look at the reference httpscloudgooglecomdataprocdocsconceptsconfiguringclustersscalingclusters
Parameters
· cluster_name (str) – The name of the cluster to scale (templated)
· project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
· region (str) – The region for the dataproc cluster (templated)
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· num_workers (int) – The new number of workers
· num_preemptible_workers (int) – The new number of preemptible workers
· graceful_decommission_timeout (str) – Timeout for graceful YARN decomissioning Maximum value is 1d
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsdataproc_operatorDataprocClusterDeleteOperator(cluster_name project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Delete a cluster on Google Cloud Dataproc The operator will wait until the cluster is destroyed
Parameters
· cluster_name (str) – The name of the cluster to create (templated)
· project_id (str) – The ID of the google cloud project in which the cluster runs (templated)
· region (str) – leave as global’ might become relevant in the future (templated)
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsdataproc_operatorDataProcPigOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pig_propertiesNone dataproc_pig_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Pig query Job on a Cloud DataProc cluster The parameters of the operation will be passed to the cluster
It’s a good practice to define dataproc_* parameters in the default_args of the dag like the cluster name and UDFs
default_args {
'cluster_name' 'cluster1'
'dataproc_pig_jars' [
'gsexampleudfjardatafu120datafujar'
'gsexampleudfjargpig12gpigjar'
]
}
You can pass a pig script as string or file reference Use variables to pass on variables for the pig script to be resolved on the cluster or use the parameters to be resolved in the script as template parameters
Example
t1 DataProcPigOperator(
task_id'dataproc_pig'
query'a_pig_scriptpig'
variables{'out' 'gsexampleoutput{{ds}}'}
dagdag)
See also
For more detail on about job submission have a look at the reference httpscloudgooglecomdataprocreferencerestv1projectsregionsjobs
Parameters
· query (str) – The query or reference to the query file (pg or pig extension) (templated)
· query_uri (str) – The uri of a pig script on Cloud Storage
· variables (dict) – Map of named parameters for the query (templated)
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_pig_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_pig_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
class airflowcontriboperatorsdataproc_operatorDataProcHiveOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hive_propertiesNone dataproc_hive_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Hive query Job on a Cloud DataProc cluster
Parameters
· query (str) – The query or reference to the query file (q extension)
· query_uri (str) – The uri of a hive script on Cloud Storage
· variables (dict) – Map of named parameters for the query
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes
· cluster_name (str) – The name of the DataProc cluster
· dataproc_hive_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_hive_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
class airflowcontriboperatorsdataproc_operatorDataProcSparkSqlOperator(queryNone query_uriNone variablesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Spark SQL query Job on a Cloud DataProc cluster
Parameters
· query (str) – The query or reference to the query file (q extension) (templated)
· query_uri (str) – The uri of a spark sql script on Cloud Storage
· variables (dict) – Map of named parameters for the query (templated)
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
class airflowcontriboperatorsdataproc_operatorDataProcSparkOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_spark_propertiesNone dataproc_spark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Spark Job on a Cloud DataProc cluster
Parameters
· main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
· main_class (str) – Name of the job class (use this or the main_jar not both together)
· arguments (list) – Arguments for the job (templated)
· archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
· files (list) – List of files to be copied to the working directory
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_spark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_spark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
class airflowcontriboperatorsdataproc_operatorDataProcHadoopOperator(main_jarNone main_classNone argumentsNone archivesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_hadoop_propertiesNone dataproc_hadoop_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Hadoop Job on a Cloud DataProc cluster
Parameters
· main_jar (str) – URI of the job jar provisioned on Cloud Storage (use this or the main_class not both together)
· main_class (str) – Name of the job class (use this or the main_jar not both together)
· arguments (list) – Arguments for the job (templated)
· archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
· files (list) – List of files to be copied to the working directory
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster (templated)
· dataproc_hadoop_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_hadoop_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
class airflowcontriboperatorsdataproc_operatorDataProcPySparkOperator(main argumentsNone archivesNone pyfilesNone filesNone job_name'{{tasktask_id}}_{{ds_nodash}}' cluster_name'cluster1' dataproc_pyspark_propertiesNone dataproc_pyspark_jarsNone gcp_conn_id'google_cloud_default' delegate_toNone region'global' job_error_states['ERROR'] *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a PySpark Job on a Cloud DataProc cluster
Parameters
· main (str) – [Required] The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver Must be a py file
· arguments (list) – Arguments for the job (templated)
· archives (list) – List of archived files that will be unpacked in the work directory Should be stored in Cloud Storage
· files (list) – List of files to be copied to the working directory
· pyfiles (list) – List of Python files to pass to the PySpark framework Supported file types py egg and zip
· job_name (str) – The job name used in the DataProc cluster This name by default is the task_id appended with the execution data but can be templated The name will always be appended with a random number to avoid name clashes (templated)
· cluster_name (str) – The name of the DataProc cluster
· dataproc_pyspark_properties (dict) – Map for the Pig properties Ideal to put in default arguments
· dataproc_pyspark_jars (list) – URIs to jars provisioned in Cloud Storage (example for UDFs and libs) and are ideal to put in default arguments
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· region (str) – The specified region where the dataproc cluster is created
· job_error_states (list) – Job states that should be considered error states Any states in this list will result in an error being raised and failure of the task Eg if the CANCELLED state should also be considered a task failure pass in ['ERROR' 'CANCELLED'] Possible values are currently only 'ERROR' and 'CANCELLED' but could change in the future Defaults to ['ERROR']
Variables
dataproc_job_id (str) – The actual jobId as submitted to the Dataproc API This is useful for identifying or linking to the job in the Google Cloud Console Dataproc UI as the actual jobId submitted to the Dataproc API is appended with an 8 character random string
class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator(project_id region'global' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateOperator(template_id *args **kwargs)[source]
Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
See also
Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiate
Parameters
· template_id (str) – The id of the template (templated)
· project_id (str) – The ID of the google cloud project in which the template runs
· region (str) – leave as global’ might become relevant in the future
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateInstantiateInlineOperator(template *args **kwargs)[source]
Bases airflowcontriboperatorsdataproc_operatorDataprocWorkflowTemplateBaseOperator
Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc The operator will wait until the WorkflowTemplate is finished executing
See also
Please refer to httpscloudgooglecomdataprocdocsreferencerestv1beta2projectsregionsworkflowTemplatesinstantiateInline
Parameters
· template (map) – The template contents (templated)
· project_id (str) – The ID of the google cloud project in which the template runs
· region (str) – leave as global’ might become relevant in the future
· gcp_conn_id (str) – The connection ID to use connecting to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsdatastore_export_operatorDatastoreExportOperator(bucket namespaceNone datastore_conn_id'google_cloud_default' cloud_storage_conn_id'google_cloud_default' delegate_toNone entity_filterNone labelsNone polling_interval_in_seconds10 overwrite_existingFalse xcom_pushFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Export entities from Google Cloud Datastore to Cloud Storage
Parameters
· bucket (str) – name of the cloud storage bucket to backup data
· namespace (str) – optional namespace path in the specified Cloud Storage bucket to backup data If this namespace does not exist in GCS it will be created
· datastore_conn_id (str) – the name of the Datastore connection id to use
· cloud_storage_conn_id (str) – the name of the cloud storage connection id to forcewrite backup
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
· labels (dict) – clientassigned labels for cloud storage
· polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
· overwrite_existing (bool) – if the storage bucket + namespace is not empty it will be emptied prior to exports This enables overwriting existing backups
· xcom_push (bool) – push operation name to xcom for reference
class airflowcontriboperatorsdatastore_import_operatorDatastoreImportOperator(bucket file namespaceNone entity_filterNone labelsNone datastore_conn_id'google_cloud_default' delegate_toNone polling_interval_in_seconds10 xcom_pushFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Import entities from Cloud Storage to Google Cloud Datastore
Parameters
· bucket (str) – container in Cloud Storage to store data
· file (str) – path of the backup metadata file in the specified Cloud Storage bucket It should have the extension overall_export_metadata
· namespace (str) – optional namespace of the backup metadata file in the specified Cloud Storage bucket
· entity_filter (dict) – description of what data from the project is included in the export refer to httpscloudgooglecomdatastoredocsreferencerestSharedTypesEntityFilter
· labels (dict) – clientassigned labels for cloud storage
· datastore_conn_id (str) – the name of the connection id to use
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· polling_interval_in_seconds (int) – number of seconds to wait before polling for execution status again
· xcom_push (bool) – push operation name to xcom for reference
class airflowcontriboperatorsdiscord_webhook_operatorDiscordWebhookOperator(http_conn_idNone webhook_endpointNone message'' usernameNone avatar_urlNone ttsFalse proxyNone *args **kwargs)[source]
Bases airflowoperatorshttp_operatorSimpleHttpOperator
This operator allows you to post messages to Discord using incoming webhooks Takes a Discord connection ID with a default relative webhook endpoint The default endpoint can be overridden using the webhook_endpoint parameter (httpsdiscordappcomdevelopersdocsresourceswebhook)
Each Discord webhook can be preconfigured to use a specific username and avatar_url You can override these defaults in this operator
Parameters
· http_conn_id (str) – Http connection ID with host as httpsdiscordcomapi and default webhook endpoint in the extra field in the form of {webhook_endpoint webhooks{webhookid}{webhooktoken}}
· webhook_endpoint (str) – Discord webhook endpoint in the form of webhooks{webhookid}{webhooktoken}
· message (str) – The message you want to send to your Discord channel (max 2000 characters) (templated)
· username (str) – Override the default username of the webhook (templated)
· avatar_url (str) – Override the default avatar of the webhook
· tts (bool) – Is a texttospeech message
· proxy (str) – Proxy to use to make the Discord webhook call
execute(context)[source]
Call the DiscordWebhookHook to post message
class airflowcontriboperatorsdruid_operatorDruidOperator(json_index_file druid_ingest_conn_id'druid_ingest_default' max_ingestion_timeNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Allows to submit a task directly to druid
Parameters
· json_index_file (str) – The filepath to the druid index specification
· druid_ingest_conn_id (str) – The connection id of the Druid overlord which accepts index jobs
class airflowcontriboperatorsecs_operatorECSOperator(task_definition cluster overrides aws_conn_idNone region_nameNone launch_type'EC2' groupNone placement_constraintsNone platform_version'LATEST' network_configurationNone **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a task on AWS EC2 Container Service
Parameters
· task_definition (str) – the task definition name on EC2 Container Service
· cluster (str) – the cluster name on EC2 Container Service
· overrides (dict) – the same parameter that boto3 will receive (templated) httpboto3readthedocsorgenlatestreferenceservicesecshtml#ECSClientrun_task
· aws_conn_id (str) – connection id of AWS credentials region name If None credential boto3 strategy will be used (httpboto3readthedocsioenlatestguideconfigurationhtml)
· region_name (str) – region name to use in AWS Hook Override the region_name in connection (if provided)
· launch_type (str) – the launch type on which to run your task (EC2’ or FARGATE’)
· group (str) – the name of the task group associated with the task
· placement_constraints (list) – an array of placement constraint objects to use for the task
· platform_version (str) – the platform version on which your task is running
· network_configuration (dict) – the network configuration for the task
class airflowcontriboperatorsemr_add_steps_operatorEmrAddStepsOperator(job_flow_id aws_conn_id's3_default' stepsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
An operator that adds steps to an existing EMR job_flow
Parameters
· job_flow_id (str) – id of the JobFlow to add steps to (templated)
· aws_conn_id (str) – aws connection to uses
· steps (list) – boto3 style steps to be added to the jobflow (templated)
class airflowcontriboperatorsemr_create_job_flow_operatorEmrCreateJobFlowOperator(aws_conn_id's3_default' emr_conn_id'emr_default' job_flow_overridesNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates an EMR JobFlow reading the config from the EMR connection A dictionary of JobFlow overrides can be passed that override the config from the connection
Parameters
· aws_conn_id (str) – aws connection to uses
· emr_conn_id (str) – emr connection to use
· job_flow_overrides – boto3 style arguments to override emr_connection extra (templated)
class airflowcontriboperatorsemr_terminate_job_flow_operatorEmrTerminateJobFlowOperator(job_flow_id aws_conn_id's3_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator to terminate EMR JobFlows
Parameters
· job_flow_id (str) – id of the JobFlow to terminate (templated)
· aws_conn_id (str) – aws connection to uses
class airflowcontriboperatorsfile_to_gcsFileToGoogleCloudStorageOperator(src dst bucket google_cloud_storage_conn_id'google_cloud_default' mime_type'applicationoctetstream' delegate_toNone gzipFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Uploads a file to Google Cloud Storage Optionally can compress the file for upload
Parameters
· src (str) – Path to the local file (templated)
· dst (str) – Destination path within the specified bucket (templated)
· bucket (str) – The bucket to upload to (templated)
· google_cloud_storage_conn_id (str) – The Airflow connection ID to upload with
· mime_type (str) – The mimetype string
· delegate_to (str) – The account to impersonate if any
· gzip (bool) – Allows for file to be compressed and uploaded as gzip
execute(context)[source]
Uploads the file to Google cloud storage
class airflowcontriboperatorsgcs_download_operatorGoogleCloudStorageDownloadOperator(bucket object filenameNone store_to_xcom_keyNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Downloads a file from Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is (templated)
· object (str) – The name of the object to download in the Google cloud storage bucket (templated)
· filename (str) – The file path on the local file system (where the operator is being executed) that the file should be downloaded to (templated) If no filename passed the downloaded data will not be stored on the local file system
· store_to_xcom_key (str) – If this param is set the operator will push the contents of the downloaded file to XCom with the key set in this parameter If not set the downloaded data will not be pushed to XCom (templated)
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsgcs_list_operatorGoogleCloudStorageListOperator(bucket prefixNone delimiterNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
List all objects from the bucket with the give string prefix and delimiter in name
This operator returns a python list with the name of objects which can be used by
xcom in the downstream task
Parameters
· bucket (str) – The Google cloud storage bucket to find the objects (templated)
· prefix (str) – Prefix string which filters objects whose name begin with this prefix (templated)
· delimiter (str) – The delimiter by which you want to filter the objects (templated) For eg to lists the CSV files from in a directory in GCS you would use delimiter’csv’
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
Example
The following Operator would list all the Avro files from salessales2017 folder in data bucket
GCS_Files GoogleCloudStorageListOperator(
task_id'GCS_Files'
bucket'data'
prefix'salessales2017'
delimiter'avro'
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
class airflowcontriboperatorsgcs_operatorGoogleCloudStorageCreateBucketOperator(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone google_cloud_storage_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
See also
For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
Parameters
· bucket_name (str) – The name of the bucket (templated)
· storage_class (str) –
This defines how objects in the bucket are stored and determines the SLA and the cost of storage (templated) Values include
o MULTI_REGIONAL
o REGIONAL
o STANDARD
o NEARLINE
o COLDLINE
If this value is not specified when the bucket is created it will default to STANDARD
· location (str) –
The location of the bucket (templated) Object data for objects in the bucket resides in physical storage within this region Defaults to US
See also
httpsdevelopersgooglecomstoragedocsbucketlocations
· project_id (str) – The ID of the GCP Project (templated)
· labels (dict) – Userprovided labels in keyvalue pairs
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
Example
The following Operator would create a new bucket testbucket with MULTI_REGIONAL storage class in EU region
CreateBucket GoogleCloudStorageCreateBucketOperator(
task_id'CreateNewBucket'
bucket_name'testbucket'
storage_class'MULTI_REGIONAL'
location'EU'
labels{'env' 'dev' 'team' 'airflow'}
google_cloud_storage_conn_id'airflowserviceaccount'
)
class airflowcontriboperatorsgcs_to_bqGoogleCloudStorageToBigQueryOperator(bucket source_objects destination_project_dataset_table schema_fieldsNone schema_objectNone source_format'CSV' compression'NONE' create_disposition'CREATE_IF_NEEDED' skip_leading_rows0 write_disposition'WRITE_EMPTY' field_delimiter' ' max_bad_records0 quote_characterNone ignore_unknown_valuesFalse allow_quoted_newlinesFalse allow_jagged_rowsFalse max_id_keyNone bigquery_conn_id'bigquery_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone schema_update_options() src_fmt_configsNone external_tableFalse time_partitioningNone cluster_fieldsNone autodetectFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Loads files from Google cloud storage into BigQuery
The schema to be used for the BigQuery table may be specified in one of two ways You may either directly pass the schema fields in or you may point the operator to a Google cloud storage object name The object in Google cloud storage must be a JSON file with the schema fields in it
Parameters
· bucket (str) – The bucket to load from (templated)
· source_objects (list of str) – List of Google cloud storage URIs to load from (templated) If source_format is DATASTORE_BACKUP’ the list must only contain a single URI
· destination_project_dataset_table (str) – The dotted ()
BigQuery table to load data into If is not included project will be the project defined in the connection json (templated)
· schema_fields (list) – If set the schema field list as defined here httpscloudgooglecombigquerydocsreferencev2jobs#configurationload Should not be set when source_format is DATASTORE_BACKUP’
· schema_object (str) – If set a GCS object path pointing to a json file that contains the schema for the table (templated)
· source_format (str) – File format to export
· compression (str) – [Optional] The compression type of the data source Possible values include GZIP and NONE The default value is NONE This setting is ignored for Google Cloud Bigtable Google Cloud Datastore backups and Avro formats
· create_disposition (str) – The create disposition if the table doesn’t exist
· skip_leading_rows (int) – Number of rows to skip when loading from a CSV
· write_disposition (str) – The write disposition if the table already exists
· field_delimiter (str) – The delimiter to use when loading from a CSV
· max_bad_records (int) – The maximum number of bad records that BigQuery can ignore when running the job
· quote_character (str) – The value that is used to quote data sections in a CSV file
· ignore_unknown_values (bool) – [Optional] Indicates if BigQuery should allow extra values that are not represented in the table schema If true the extra values are ignored If false records with extra columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result
· allow_quoted_newlines (bool) – Whether to allow quoted newlines (true) or not (false)
· allow_jagged_rows (bool) – Accept rows that are missing trailing optional columns The missing values are treated as nulls If false records with missing trailing columns are treated as bad records and if there are too many bad records an invalid error is returned in the job result Only applicable to CSV ignored for other formats
· max_id_key (str) – If set the name of a column in the BigQuery table that’s to be loaded This will be used to select the MAX value from BigQuery after the load occurs The results will be returned by the execute() command which in turn gets stored in XCom for future operators to use This can be helpful with incremental loads–during future executions you can pick up from the max ID
· bigquery_conn_id (str) – Reference to a specific BigQuery hook
· google_cloud_storage_conn_id (str) – Reference to a specific Google cloud storage hook
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· schema_update_options (list) – Allows the schema of the destination table to be updated as a side effect of the load job
· src_fmt_configs (dict) – configure optional fields specific to the source format
· external_table (bool) – Flag to specify if the destination table should be a BigQuery external table Default Value is False
· time_partitioning (dict) – configure optional time partitioning fields ie partition by field type and expiration as per API specifications Note that field’ is not available in concurrency with datasettablepartition
· cluster_fields (list of str) – Request that the result of this load be stored sorted by one or more columns This is only available in conjunction with time_partitioning The order of columns given determines the sort order Not applicable for external tables
class airflowcontriboperatorsgcs_to_gcsGoogleCloudStorageToGoogleCloudStorageOperator(source_bucket source_object destination_bucketNone destination_objectNone move_objectFalse google_cloud_storage_conn_id'google_cloud_default' delegate_toNone last_modified_timeNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copies objects from a bucket to another with renaming if requested
Parameters
· source_bucket (str) – The source Google cloud storage bucket where the object is (templated)
· source_object (str) – The source name of the object to copy in the Google cloud storage bucket (templated) You can use only one wildcard for objects (filenames) within your bucket The wildcard can appear inside the object name or at the end of the object name Appending a wildcard to the bucket name is unsupported
· destination_bucket (str) – The destination Google cloud storage bucket where the object should be (templated)
· destination_object (str) – The destination name of the object in the destination Google cloud storage bucket (templated) If a wildcard is supplied in the source_object argument this is the prefix that will be prepended to the final destination objects’ paths Note that the source path’s part before the wildcard will be removed if it needs to be retained it should be appended to destination_object For example with prefix foo* and destination_object blah the file foobaz will be copied to blahbaz to retain the prefix write the destination_object as eg blahfoo in which case the copied file will be named blahfoobaz
· move_object (bool) – When move object is True the object is moved instead of copied to the new location This is the equivalent of a mv command as opposed to a cp command
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google cloud storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· last_modified_time (datetime) – When specified if the object(s) were modified after last_modified_time they will be copiedmoved If tzinfo has not been set UTC will be assumed
Examples
The following Operator would copy a single file named salessales2017januaryavro in the data bucket to the file named copied_sales2017januarybackupavro in the data_backup bucket
copy_single_file GoogleCloudStorageToGoogleCloudStorageOperator(
task_id'copy_single_file'
source_bucket'data'
source_object'salessales2017januaryavro'
destination_bucket'data_backup'
destination_object'copied_sales2017januarybackupavro'
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
The following Operator would copy all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the copied_sales2017 folder in the data_backup bucket
copy_files GoogleCloudStorageToGoogleCloudStorageOperator(
task_id'copy_files'
source_bucket'data'
source_object'salessales2017*avro'
destination_bucket'data_backup'
destination_object'copied_sales2017'
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
The following Operator would move all the Avro files from salessales2017 folder (ie with names starting with that prefix) in data bucket to the same folder in the data_backup bucket deleting the original files in the process
move_files GoogleCloudStorageToGoogleCloudStorageOperator(
task_id'move_files'
source_bucket'data'
source_object'salessales2017*avro'
destination_bucket'data_backup'
move_objectTrue
google_cloud_storage_conn_idgoogle_cloud_conn_id
)
class airflowcontriboperatorsgcs_to_s3GoogleCloudStorageToS3Operator(bucket prefixNone delimiterNone google_cloud_storage_conn_id'google_cloud_storage_default' delegate_toNone dest_aws_conn_idNone dest_s3_keyNone dest_verifyNone replaceFalse *args **kwargs)[source]
Bases airflowcontriboperatorsgcs_list_operatorGoogleCloudStorageListOperator
Synchronizes a Google Cloud Storage bucket with an S3 bucket
Parameters
· bucket (str) – The Google Cloud Storage bucket to find the objects (templated)
· prefix (str) – Prefix string which filters objects whose name begin with this prefix (templated)
· delimiter (str) – The delimiter by which you want to filter the objects (templated) For eg to lists the CSV files from in a directory in GCS you would use delimiter’csv’
· google_cloud_storage_conn_id (str) – The connection ID to use when connecting to Google Cloud Storage
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· dest_aws_conn_id (str) – The destination S3 connection
· dest_s3_key (str) – The base S3 key to be used to store the files (templated)
Parame dest_verify
 
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
class airflowcontriboperatorshipchat_operatorHipChatAPIOperator(token base_url'httpsapihipchatcomv2' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Base HipChat Operator All derived HipChat operators reference from HipChat’s official REST API documentation at httpswwwhipchatcomdocsapiv2 Before using any HipChat API operators you need to get an authentication token at httpswwwhipchatcomdocsapiv2auth In the future additional HipChat operators will be derived from this class as well
Parameters
· token (str) – HipChat REST API authentication token
· base_url (str) – HipChat REST API base url
prepare_request()[source]
Used by the execute function Set the request method url and body of HipChat’s REST API call Override in child class Each HipChatAPI child operator is responsible for having a prepare_request method call which sets selfmethod selfurl and selfbody
class airflowcontriboperatorshipchat_operatorHipChatAPISendRoomNotificationOperator(room_id message message_format'html' color'yellow' frm'airflow' attach_toNone notifyFalse cardNone *args **kwargs)[source]
Bases airflowcontriboperatorshipchat_operatorHipChatAPIOperator
Send notification to a specific HipChat room More info httpswwwhipchatcomdocsapiv2methodsend_room_notification
Parameters
· room_id (str) – Room in which to send notification on HipChat (templated)
· message (str) – The message body (templated)
· frm (str) – Label to be shown in addition to sender’s name
· message_format (str) – How the notification is rendered html or text
· color (str) – Background color of the msg yellow green red purple gray or random
· attach_to (str) – The message id to attach this notification to
· notify (bool) – Whether this message should trigger a user notification
· card (dict) – HipChatdefined card object
prepare_request()[source]
Used by the execute function Set the request method url and body of HipChat’s REST API call Override in child class Each HipChatAPI child operator is responsible for having a prepare_request method call which sets selfmethod selfurl and selfbody
class airflowcontriboperatorshive_to_dynamodbHiveToDynamoDBTransferOperator(sql table_name table_keys pre_processNone pre_process_argsNone pre_process_kwargsNone region_nameNone schema'default' hiveserver2_conn_id'hiveserver2_default' aws_conn_id'aws_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Hive to DynamoDB note that for now the data is loaded into memory before being pushed to DynamoDB so this operator should be used for smallish amount of data
Parameters
· sql (str) – SQL query to execute against the hive database (templated)
· table_name (str) – target DynamoDB table
· table_keys (list) – partition key and sort key
· pre_process (function) – implement preprocessing of source data
· pre_process_args (list) – list of pre_process function arguments
· pre_process_kwargs (dict) – dict of pre_process function arguments
· region_name (str) – aws region name (example useast1)
· schema (str) – hive database schema
· hiveserver2_conn_id (str) – source hive connection
· aws_conn_id (str) – aws connection
class airflowcontriboperatorsmlengine_operatorMLEngineBatchPredictionOperator(project_id job_id region data_format input_paths output_path model_nameNone version_nameNone uriNone max_worker_countNone runtime_versionNone gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Start a Google Cloud ML Engine prediction job
NOTE For model origin users should consider exactly one from the three options below 1 Populate uri’ field only which should be a GCS location that points to a tensorflow savedModel directory 2 Populate model_name’ field only which refers to an existing model and the default version of the model will be used 3 Populate both model_name’ and version_name’ fields which refers to a specific version of a specific model
In options 2 and 3 both model and version name should contain the minimal identifier For instance call
MLEngineBatchPredictionOperator(

model_name'my_model'
version_name'my_version'
)
if the desired model version is projectsmy_projectmodelsmy_modelversionsmy_version
See httpscloudgooglecommlenginereferencerestv1projectsjobs for further documentation on the parameters
Parameters
· project_id (str) – The Google Cloud project name where the prediction job is submitted (templated)
· job_id (str) – A unique id for the prediction job on Google Cloud ML Engine (templated)
· data_format (str) – The format of the input data It will default to DATA_FORMAT_UNSPECIFIED’ if is not provided or is not one of [TEXT TF_RECORD TF_RECORD_GZIP]
· input_paths (list of string) – A list of GCS paths of input data for batch prediction Accepting wildcard operator * but only at the end (templated)
· output_path (str) – The GCS path where the prediction results are written to (templated)
· region (str) – The Google Compute Engine region to run the prediction job in (templated)
· model_name (str) – The Google Cloud ML Engine model to use for prediction If version_name is not provided the default version of this model will be used Should not be None if version_name is provided Should be None if uri is provided (templated)
· version_name (str) – The Google Cloud ML Engine model version to use for prediction Should be None if uri is provided (templated)
· uri (str) – The GCS path of the saved model to use for prediction Should be None if model_name is provided It should be a GCS path pointing to a tensorflow SavedModel (templated)
· max_worker_count (int) – The maximum number of workers to be used for parallel processing Defaults to 10 if not specified
· runtime_version (str) – The Google Cloud ML Engine runtime version to use for batch prediction
· gcp_conn_id (str) – The connection ID used for connection to Google Cloud Platform
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have doaminwide delegation enabled
Raises
ValueError if a unique modelversion origin cannot be determined
class airflowcontriboperatorsmlengine_operatorMLEngineModelOperator(project_id model operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator for managing a Google Cloud ML Engine model
Parameters
· project_id (str) – The Google Cloud project name to which MLEngine model belongs (templated)
· model (dict) –
A dictionary containing the information about the model If the operation is create then the model parameter should contain all the information about this model such as name
If the operation is get the model parameter should contain the name of the model
· operation (str) –
The operation to perform Available operations are
o create Creates a new model as provided by the model parameter
o get Gets a particular model where the name is specified in model
· gcp_conn_id (str) – The connection ID to use when fetching connection info
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsmlengine_operatorMLEngineVersionOperator(project_id model_name version_nameNone versionNone operation'create' gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator for managing a Google Cloud ML Engine version
Parameters
· project_id (str) – The Google Cloud project name to which MLEngine model belongs
· model_name (str) – The name of the Google Cloud ML Engine model that the version belongs to (templated)
· version_name (str) – A name to use for the version being operated upon If not None and the version argument is None or does not have a value for the name key then this will be populated in the payload for the name key (templated)
· version (dict) – A dictionary containing the information about the version If the operation is create version should contain all the information about this version such as name and deploymentUrl If the operation is get or delete the version parameter should contain the name of the version If it is None the only operation possible would be list (templated)
· operation (str) –
The operation to perform Available operations are
o create Creates a new version in the model specified by model_name in which case the version parameter should contain all the information to create that version (eg name deploymentUrl)
o get Gets full information of a particular version in the model specified by model_name The name of the version should be specified in the version parameter
o list Lists all available versions of the model specified by model_name
o delete Deletes the version specified in version parameter from the model specified by model_name) The name of the version should be specified in the version parameter
· gcp_conn_id (str) – The connection ID to use when fetching connection info
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
class airflowcontriboperatorsmlengine_operatorMLEngineTrainingOperator(project_id job_id package_uris training_python_module training_args region scale_tierNone runtime_versionNone python_versionNone job_dirNone gcp_conn_id'google_cloud_default' delegate_toNone mode'PRODUCTION' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Operator for launching a MLEngine training job
Parameters
· project_id (str) – The Google Cloud project name within which MLEngine training job should run (templated)
· job_id (str) – A unique templated id for the submitted Google MLEngine training job (templated)
· package_uris (str) – A list of package locations for MLEngine training job which should include the main training program + any additional dependencies (templated)
· training_python_module (str) – The Python module name to run within MLEngine training job after installing package_uris’ packages (templated)
· training_args (str) – A list of templated command line arguments to pass to the MLEngine training program (templated)
· region (str) – The Google Compute Engine region to run the MLEngine training job in (templated)
· scale_tier (str) – Resource tier for MLEngine training job (templated)
· runtime_version (str) – The Google Cloud ML runtime version to use for training (templated)
· python_version (str) – The version of Python used in training (templated)
· job_dir (str) – A Google Cloud Storage path in which to store training outputs and other data needed for training (templated)
· gcp_conn_id (str) – The connection ID to use when fetching connection info
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· mode (str) – Can be one of DRY_RUN’’CLOUD’ In DRY_RUN’ mode no real training job will be launched but the MLEngine training job request will be printed out In CLOUD’ mode a real MLEngine training job creation request will be issued
class airflowcontriboperatorsmongo_to_s3MongoToS3Operator(mongo_conn_id s3_conn_id mongo_collection mongo_query s3_bucket s3_key mongo_dbNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Mongo > S3
A more specific baseOperator meant to move data from mongo via pymongo to s3 via boto
things to note
execute() is written to depend on transform() transform() is meant to be extended by child classes to perform transformations unique to those operators needs
execute(context)[source]
Executed by task_instance at runtime
static transform(docs)[source]
Processes pyMongo cursor and returns an iterable with each element being
a JSON serializable dictionary
Base transform() assumes no processing is needed ie docs is a pyMongo cursor of documents and cursor just needs to be passed through
Override this method for custom transformations
class airflowcontriboperatorsmysql_to_gcsMySqlToGoogleCloudStorageOperator(sql bucket filename schema_filenameNone approx_max_file_size_bytes1900000000 mysql_conn_id'mysql_default' google_cloud_storage_conn_id'google_cloud_default' schemaNone delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copy data from MySQL to Google cloud storage in JSON format
classmethod type_map(mysql_type)[source]
Helper function that maps from MySQL fields to BigQuery fields Used when a schema_filename is set
class airflowcontriboperatorspostgres_to_gcs_operatorPostgresToGoogleCloudStorageOperator(sql bucket filename schema_filenameNone approx_max_file_size_bytes1900000000 postgres_conn_id'postgres_default' google_cloud_storage_conn_id'google_cloud_default' delegate_toNone parametersNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Copy data from Postgres to Google Cloud Storage in JSON format
classmethod convert_types(value)[source]
Takes a value from Postgres and converts it to a value that’s safe for JSONGoogle Cloud StorageBigQuery Dates are converted to UTC seconds Decimals are converted to floats Times are converted to seconds
classmethod type_map(postgres_type)[source]
Helper function that maps from Postgres fields to BigQuery fields Used when a schema_filename is set
class airflowcontriboperatorspubsub_operatorPubSubTopicCreateOperator(project topic fail_if_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Create a PubSub topic
By default if the topic already exists this operator will not cause the DAG to fail
with DAG('successful DAG') as dag
(
dag
>> PubSubTopicCreateOperator(project'myproject'
topic'my_new_topic')
>> PubSubTopicCreateOperator(project'myproject'
topic'my_new_topic')
)
The operator can be configured to fail if the topic already exists
with DAG('failing DAG') as dag
(
dag
>> PubSubTopicCreateOperator(project'myproject'
topic'my_new_topic')
>> PubSubTopicCreateOperator(project'myproject'
topic'my_new_topic'
fail_if_existsTrue)
)
Both project and topic are templated so you can use variables in them
class airflowcontriboperatorspubsub_operatorPubSubTopicDeleteOperator(project topic fail_if_not_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Delete a PubSub topic
By default if the topic does not exist this operator will not cause the DAG to fail
with DAG('successful DAG') as dag
(
dag
>> PubSubTopicDeleteOperator(project'myproject'
topic'non_existing_topic')
)
The operator can be configured to fail if the topic does not exist
with DAG('failing DAG') as dag
(
dag
>> PubSubTopicCreateOperator(project'myproject'
topic'non_existing_topic'
fail_if_not_existsTrue)
)
Both project and topic are templated so you can use variables in them
class airflowcontriboperatorspubsub_operatorPubSubSubscriptionCreateOperator(topic_project topic subscriptionNone subscription_projectNone ack_deadline_secs10 fail_if_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Create a PubSub subscription
By default the subscription will be created in topic_project If subscription_project is specified and the GCP credentials allow the Subscription can be created in a different project from its topic
By default if the subscription already exists this operator will not cause the DAG to fail However the topic must exist in the project
with DAG('successful DAG') as dag
(
dag
>> PubSubSubscriptionCreateOperator(
topic_project'myproject' topic'mytopic'
subscription'mysubscription')
>> PubSubSubscriptionCreateOperator(
topic_project'myproject' topic'mytopic'
subscription'mysubscription')
)
The operator can be configured to fail if the subscription already exists
with DAG('failing DAG') as dag
(
dag
>> PubSubSubscriptionCreateOperator(
topic_project'myproject' topic'mytopic'
subscription'mysubscription')
>> PubSubSubscriptionCreateOperator(
topic_project'myproject' topic'mytopic'
subscription'mysubscription' fail_if_existsTrue)
)
Finally subscription is not required If not passed the operator will generated a universally unique identifier for the subscription’s name
with DAG('DAG') as dag
(
dag >> PubSubSubscriptionCreateOperator(
topic_project'myproject' topic'mytopic')
)
topic_project topic subscription and subscription are templated so you can use variables in them
class airflowcontriboperatorspubsub_operatorPubSubSubscriptionDeleteOperator(project subscription fail_if_not_existsFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Delete a PubSub subscription
By default if the subscription does not exist this operator will not cause the DAG to fail
with DAG('successful DAG') as dag
(
dag
>> PubSubSubscriptionDeleteOperator(project'myproject'
subscription'nonexisting')
)
The operator can be configured to fail if the subscription already exists
with DAG('failing DAG') as dag
(
dag
>> PubSubSubscriptionDeleteOperator(
project'myproject' subscription'nonexisting'
fail_if_not_existsTrue)
)
project and subscription are templated so you can use variables in them
class airflowcontriboperatorspubsub_operatorPubSubPublishOperator(project topic messages gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Publish messages to a PubSub topic
Each Task publishes all provided messages to the same topic in a single GCP project If the topic does not exist this task will fail
from base64 import b64encode as b64e

m1 {'data' b64e('Hello World')
'attributes' {'type' 'greeting'}
}
m2 {'data' b64e('Knock knock')}
m3 {'attributes' {'foo' ''}}

t1 PubSubPublishOperator(
project'myproject'topic'my_topic'
messages[m1 m2 m3]
create_topicTrue
dagdag)

``project`` ``topic`` and ``messages`` are templated so you can use
variables in them
class airflowcontriboperatorss3_copy_object_operatorS3CopyObjectOperator(source_bucket_key dest_bucket_key source_bucket_nameNone dest_bucket_nameNone source_version_idNone aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Creates a copy of an object that is already stored in S3
Note the S3 connection used here needs to have access to both source and destination bucketkey
Parameters
· source_bucket_key (str) –
The key of the source object
It can be either full s3 style url or relative path from root level
When it’s specified as a full s3 url please omit source_bucket_name
· dest_bucket_key (str) –
The key of the object to copy to
The convention to specify dest_bucket_key is the same as source_bucket_key
· source_bucket_name (str) –
Name of the S3 bucket where the source object is in
It should be omitted when source_bucket_key is provided as a full s3 url
· dest_bucket_name (str) –
Name of the S3 bucket to where the object is copied
It should be omitted when dest_bucket_key is provided as a full s3 url
· source_version_id (str) – Version ID of the source object (OPTIONAL)
· aws_conn_id (str) – Connection id of the S3 connection to use
· verify (bool or str) –
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified
You can provide the following values
o False do not validate SSL certificates SSL will still be used
but SSL certificates will not be verified
o pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
class airflowcontriboperatorss3_delete_objects_operatorS3DeleteObjectsOperator(bucket keys aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
To enable users to delete single object or multiple objects from a bucket using a single HTTP request
Users may specify up to 1000 keys to delete
Parameters
· bucket (str) – Name of the bucket in which you are going to delete object(s)
· keys (str or list) –
The key(s) to delete from S3 bucket
When keys is a string it’s supposed to be the key name of the single object to delete
When keys is a list it’s supposed to be the list of the keys to delete
You may specify up to 1000 keys
· aws_conn_id (str) – Connection id of the S3 connection to use
· verify (bool or str) –
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified
You can provide the following values
o False do not validate SSL certificates SSL will still be used
but SSL certificates will not be verified
o pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
class airflowcontriboperatorss3_list_operatorS3ListOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
List all objects from the bucket with the given string prefix in name
This operator returns a python list with the name of objects which can be used by xcom in the downstream task
Parameters
· bucket (str) – The S3 bucket where to find the objects (templated)
· prefix (str) – Prefix string to filters the objects whose name begin with such prefix (templated)
· delimiter (str) – the delimiter marks key hierarchy (templated)
· aws_conn_id (str) – The connection ID to use when connecting to S3 storage
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
Example
The following operator would list all the files (excluding subfolders) from the S3 customers201804 key in the data bucket
s3_file S3ListOperator(
task_id'list_3s_files'
bucket'data'
prefix'customers201804'
delimiter''
aws_conn_id'aws_customers_conn'
)
class airflowcontriboperatorss3_to_gcs_operatorS3ToGoogleCloudStorageOperator(bucket prefix'' delimiter'' aws_conn_id'aws_default' verifyNone dest_gcs_conn_idNone dest_gcsNone delegate_toNone replaceFalse *args **kwargs)[source]
Bases airflowcontriboperatorss3_list_operatorS3ListOperator
Synchronizes an S3 key possibly a prefix with a Google Cloud Storage destination path
Parameters
· bucket (str) – The S3 bucket where to find the objects (templated)
· prefix (str) – Prefix string which filters objects whose name begin with such prefix (templated)
· delimiter (str) – the delimiter marks key hierarchy (templated)
· aws_conn_id (str) – The source S3 connection
· dest_gcs_conn_id (str) – The destination connection ID to use when connecting to Google Cloud Storage
· dest_gcs (str) – The destination Google Cloud Storage bucket and prefix where you want to store the files (templated)
· delegate_to (str) – The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
· replace (bool) – Whether you want to replace existing destination files or not
Parame verify
Whether or not to verify SSL certificates for S3 connection By default SSL certificates are verified You can provide the following values False do not validate SSL certificates SSL will still be used
(unless use_ssl is False) but SSL certificates will not be verified
· pathtocertbundlepem A filename of the CA cert bundle to uses
You can specify this argument if you want to use a different CA cert bundle than the one used by botocore
Example
s3_to_gcs_op S3ToGoogleCloudStorageOperator(
task_id's3_to_gcs_example'
bucket'mys3bucket'
prefix'datacustomers201804'
dest_gcs_conn_id'google_cloud_default'
dest_gcs'gsmygcsbucketsomecustomers'
replaceFalse
dagmydag)
Note that bucket prefix delimiter and dest_gcs are templated so you can use variables in them if you wish
class airflowcontriboperatorssftp_operatorSFTPOperator(ssh_hookNone ssh_conn_idNone remote_hostNone local_filepathNone remote_filepathNone operation'put' confirmTrue *args **kwargs)[source]
Bases airflowmodelsBaseOperator
SFTPOperator for transferring files from remote host to local or vice a versa This operator uses ssh_hook to open sftp transport channel that serve as basis for file transfer
Parameters
· ssh_hook (SSHHook) – predefined ssh_hook to use for remote execution Either ssh_hook or ssh_conn_id needs to be provided
· ssh_conn_id (str) – connection id from airflow Connections ssh_conn_id will be ingored if ssh_hook is provided
· remote_host (str) – remote host to connect (templated) Nullable If provided it will replace the remote_host which was defined in ssh_hook or predefined in the connection of ssh_conn_id
· local_filepath (str) – local file path to get or put (templated)
· remote_filepath (str) – remote file path to get or put (templated)
· operation – specify operation get’ or put’ defaults to put
· confirm (bool) – specify if the SFTP operation should be confirmed defaults to True
class airflowcontriboperatorsslack_webhook_operatorSlackWebhookOperator(http_conn_idNone webhook_tokenNone message'' channelNone usernameNone icon_emojiNone link_namesFalse proxyNone *args **kwargs)[source]
Bases airflowoperatorshttp_operatorSimpleHttpOperator
This operator allows you to post messages to Slack using incoming webhooks Takes both Slack webhook token directly and connection that has Slack webhook token If both supplied Slack webhook token will be used
Each Slack webhook token can be preconfigured to use a specific channel username and icon You can override these defaults in this hook
Parameters
· http_conn_id (str) – connection that has Slack webhook token in the extra field
· webhook_token (str) – Slack webhook token
· message (str) – The message you want to send on Slack
· channel (str) – The channel the message should be posted to
· username (str) – The username to post to slack with
· icon_emoji (str) – The emoji to use as icon for the user posting to Slack
· link_names (bool) – Whether or not to find and link channel and usernames in your message
· proxy (str) – Proxy to use to make the Slack webhook call
execute(context)[source]
Call the SlackWebhookHook to post the provided Slack message
class airflowcontriboperatorsspark_jdbc_operatorSparkJDBCOperator(spark_app_name'airflowsparkjdbc' spark_conn_id'sparkdefault' spark_confNone spark_py_filesNone spark_filesNone spark_jarsNone num_executorsNone executor_coresNone executor_memoryNone driver_memoryNone verboseFalse keytabNone principalNone cmd_type'spark_to_jdbc' jdbc_tableNone jdbc_conn_id'jdbcdefault' jdbc_driverNone metastore_tableNone jdbc_truncateFalse save_modeNone save_formatNone batch_sizeNone fetch_sizeNone num_partitionsNone partition_columnNone lower_boundNone upper_boundNone create_table_column_typesNone *args **kwargs)[source]
Bases airflowcontriboperatorsspark_submit_operatorSparkSubmitOperator
This operator extends the SparkSubmitOperator specifically for performing data transfers tofrom JDBCbased databases with Apache Spark As with the SparkSubmitOperator it assumes that the sparksubmit binary is available on the PATH
Parameters
· spark_app_name (str) – Name of the job (default airflowsparkjdbc)
· spark_conn_id (str) – Connection id as configured in Airflow administration
· spark_conf (dict) – Any additional Spark configuration properties
· spark_py_files (str) – Additional python files used (zip egg or py)
· spark_files (str) – Additional files to upload to the container running the job
· spark_jars (str) – Additional jars to upload and add to the driver and executor classpath
· num_executors (int) – number of executor to run This should be set so as to manage the number of connections made with the JDBC database
· executor_cores (int) – Number of cores per executor
· executor_memory (str) – Memory per executor (eg 1000M 2G)
· driver_memory (str) – Memory allocated to the driver (eg 1000M 2G)
· verbose (bool) – Whether to pass the verbose flag to sparksubmit for debugging
· keytab (str) – Full path to the file that contains the keytab
· principal (str) – The name of the kerberos principal used for keytab
· cmd_type (str) – Which way the data should flow 2 possible values spark_to_jdbc data written by spark from metastore to jdbc jdbc_to_spark data written by spark from jdbc to metastore
· jdbc_table (str) – The name of the JDBC table
· jdbc_conn_id – Connection id used for connection to JDBC database
· jdbc_driver (str) – Name of the JDBC driver to use for the JDBC connection This driver (usually a jar) should be passed in the jars’ parameter
· metastore_table (str) – The name of the metastore table
· jdbc_truncate (bool) – (spark_to_jdbc only) Whether or not Spark should truncate or drop and recreate the JDBC table This only takes effect if save_mode’ is set to Overwrite Also if the schema is different Spark cannot truncate and will drop and recreate
· save_mode (str) – The Spark savemode to use (eg overwrite append etc)
· save_format (str) – (jdbc_to_sparkonly) The Spark saveformat to use (eg parquet)
· batch_size (int) – (spark_to_jdbc only) The size of the batch to insert per round trip to the JDBC database Defaults to 1000
· fetch_size (int) – (jdbc_to_spark only) The size of the batch to fetch per round trip from the JDBC database Default depends on the JDBC driver
· num_partitions (int) – The maximum number of partitions that can be used by Spark simultaneously both for spark_to_jdbc and jdbc_to_spark operations This will also cap the number of JDBC connections that can be opened
· partition_column (str) – (jdbc_to_sparkonly) A numeric column to be used to partition the metastore table by If specified you must also specify num_partitions lower_bound upper_bound
· lower_bound (int) – (jdbc_to_sparkonly) Lower bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column upper_bound
· upper_bound (int) – (jdbc_to_sparkonly) Upper bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column lower_bound
· create_table_column_types – (spark_to_jdbconly) The database column data types to use instead of the defaults when creating the table Data type information should be specified in the same format as CREATE TABLE columns syntax (eg name CHAR(64) comments VARCHAR(1024)) The specified types should be valid spark sql data types
Type
jdbc_conn_id str
execute(context)[source]
Call the SparkSubmitHook to run the provided spark job
class airflowcontriboperatorsspark_sql_operatorSparkSqlOperator(sql confNone conn_id'spark_sql_default' total_executor_coresNone executor_coresNone executor_memoryNone keytabNone principalNone master'yarn' name'defaultname' num_executorsNone yarn_queue'default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute Spark SQL query
Parameters
· sql (str) – The SQL query to execute (templated)
· conf (str (format PROPVALUE)) – arbitrary Spark configuration property
· conn_id (str) – connection_id string
· total_executor_cores (int) – (Standalone & Mesos only) Total cores for all executors (Default all the available cores on the worker)
· executor_cores (int) – (Standalone & YARN only) Number of cores per executor (Default 2)
· executor_memory (str) – Memory per executor (eg 1000M 2G) (Default 1G)
· keytab (str) – Full path to the file that contains the keytab
· master (str) – sparkhostport mesoshostport yarn or local
· name (str) – Name of the job
· num_executors (int) – Number of executors to launch
· verbose (bool) – Whether to pass the verbose flag to sparksql
· yarn_queue (str) – The YARN queue to submit to (Default default)
execute(context)[source]
Call the SparkSqlHook to run the provided sql query
class airflowcontriboperatorsspark_submit_operatorSparkSubmitOperator(application'' confNone conn_id'spark_default' filesNone py_filesNone driver_classpathNone jarsNone java_classNone packagesNone exclude_packagesNone repositoriesNone total_executor_coresNone executor_coresNone executor_memoryNone driver_memoryNone keytabNone principalNone name'airflowspark' num_executorsNone application_argsNone env_varsNone verboseFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
This hook is a wrapper around the sparksubmit binary to kick off a sparksubmit job It requires that the sparksubmit binary is in the PATH or the sparkhome is set in the extra on the connection
Parameters
· application (str) – The application that submitted as a job either jar or py file (templated)
· conf (dict) – Arbitrary Spark configuration properties
· conn_id (str) – The connection id as configured in Airflow administration When an invalid connection_id is supplied it will default to yarn
· files (str) – Upload additional files to the executor running the job separated by a comma Files will be placed in the working directory of each executor For example serialized objects
· py_files (str) – Additional python files used by the job can be zip egg or py
· jars (str) – Submit additional jars to upload and place them in executor classpath
· driver_classpath (str) – Additional driverspecific classpath settings
· java_class (str) – the main class of the Java application
· packages (str) – Commaseparated list of maven coordinates of jars to include on the driver and executor classpaths (templated)
· exclude_packages (str) – Commaseparated list of maven coordinates of jars to exclude while resolving the dependencies provided in packages’
· repositories (str) – Commaseparated list of additional remote repositories to search for the maven coordinates given with packages’
· total_executor_cores (int) – (Standalone & Mesos only) Total cores for all executors (Default all the available cores on the worker)
· executor_cores (int) – (Standalone & YARN only) Number of cores per executor (Default 2)
· executor_memory (str) – Memory per executor (eg 1000M 2G) (Default 1G)
· driver_memory (str) – Memory allocated to the driver (eg 1000M 2G) (Default 1G)
· keytab (str) – Full path to the file that contains the keytab
· principal (str) – The name of the kerberos principal used for keytab
· name (str) – Name of the job (default airflowspark) (templated)
· num_executors (int) – Number of executors to launch
· application_args (list) – Arguments for the application being submitted
· env_vars (dict) – Environment variables for sparksubmit It supports yarn and k8s mode too
· verbose (bool) – Whether to pass the verbose flag to sparksubmit process for debugging
execute(context)[source]
Call the SparkSubmitHook to run the provided spark job
class airflowcontriboperatorssqoop_operatorSqoopOperator(conn_id'sqoop_default' cmd_type'import' tableNone queryNone target_dirNone appendNone file_type'text' columnsNone num_mappersNone split_byNone whereNone export_dirNone input_null_stringNone input_null_non_stringNone staging_tableNone clear_staging_tableFalse enclosed_byNone escaped_byNone input_fields_terminated_byNone input_lines_terminated_byNone input_optionally_enclosed_byNone batchFalse directFalse driverNone verboseFalse relaxed_isolationFalse propertiesNone hcatalog_databaseNone hcatalog_tableNone create_hcatalog_tableFalse extra_import_optionsNone extra_export_optionsNone *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Execute a Sqoop job Documentation for Apache Sqoop can be found here
httpssqoopapacheorgdocs142SqoopUserGuidehtml
execute(context)[source]
Execute sqoop job
class airflowcontriboperatorsssh_operatorSSHOperator(ssh_hookNone ssh_conn_idNone remote_hostNone commandNone timeout10 do_xcom_pushFalse *args **kwargs)[source]
Bases airflowmodelsBaseOperator
SSHOperator to execute commands on given remote host using the ssh_hook
Parameters
· ssh_hook (SSHHook) – predefined ssh_hook to use for remote execution Either ssh_hook or ssh_conn_id needs to be provided
· ssh_conn_id (str) – connection id from airflow Connections ssh_conn_id will be ingored if ssh_hook is provided
· remote_host (str) – remote host to connect (templated) Nullable If provided it will replace the remote_host which was defined in ssh_hook or predefined in the connection of ssh_conn_id
· command (str) – command to execute on remote host (templated)
· timeout (int) – timeout (in seconds) for executing the command
· do_xcom_push (bool) – return the stdout which also get set in xcom by airflow platform
class airflowcontriboperatorsvertica_operatorVerticaOperator(sql vertica_conn_id'vertica_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Executes sql code in a specific Vertica database
Parameters
· vertica_conn_id (str) – reference to a specific Vertica database
· sql (Can receive a str representing a sql statement a list of str (sql statements) or reference to a template file Template reference are recognized by str ending in 'sql') – the sql code to be executed (templated)
class airflowcontriboperatorsvertica_to_hiveVerticaToHiveTransfer(sql hive_table createTrue recreateFalse partitionNone delimiteru'x01' vertica_conn_id'vertica_default' hive_cli_conn_id'hive_cli_default' *args **kwargs)[source]
Bases airflowmodelsBaseOperator
Moves data from Vertia to Hive The operator runs your query against Vertia stores the file locally before loading it into a Hive table If the create or recreate arguments are set to True a CREATE TABLE and DROP TABLE statements are generated Hive data types are inferred from the cursor’s metadata Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the table gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
Parameters
· sql (str) – SQL query to execute against the Vertia database (templated)
· hive_table (str) – target Hive table use dot notation to target a specific database (templated)
· create (bool) – whether to create the table if it doesn’t exist
· recreate (bool) – whether to drop and recreate the table at every execution
· partition (dict) – target partition as a dict of partition columns and values (templated)
· delimiter (str) – field delimiter in the file
· vertica_conn_id (str) – source Vertica connection
· hive_conn_id (str) – destination hive connection
Sensors
class airflowcontribsensorsaws_redshift_cluster_sensorAwsRedshiftClusterSensor(cluster_identifier target_status'available' aws_conn_id'aws_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a Redshift cluster to reach a specific status
Parameters
· cluster_identifier (str) – The identifier for the cluster being pinged
· target_status (str) – The cluster status desired
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsbash_sensorBashSensor(bash_command envNone output_encoding'utf8' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Executes a bash commandscript and returns True if and only if the return code is 0
Parameters
· bash_command (str) – The command set of commands or reference to a bash script (must be sh’) to be executed
· env (dict) – If env is not None it must be a mapping that defines the environment variables for the new process these are used instead of inheriting the current process environment which is the default behavior (templated)
· output_encoding (str) – output encoding of bash command
poke(context)[source]
Execute the bash command in a temporary directory which will be cleaned afterwards
class airflowcontribsensorsbigquery_sensorBigQueryTableSensor(project_id dataset_id table_id bigquery_conn_id'bigquery_default_conn' delegate_toNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Checks for the existence of a table in Google Bigquery
param project_id
 
The Google cloud project in which to look for the table The connection supplied to the hook must provide access to the specified project
type project_id
 
str
param dataset_id
 
The name of the dataset in which to look for the table storage bucket
type dataset_id
 
str
param table_id
The name of the table to check the existence of
type table_id
str
param bigquery_conn_id
 
The connection ID to use when connecting to Google BigQuery
type bigquery_conn_id
 
str
param delegate_to
 
The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
type delegate_to
 
str
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorscassandra_record_sensorCassandraRecordSensor(table keys cassandra_conn_id *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Checks for the existence of a record in a Cassandra cluster
For example if you want to wait for a record that has values v1’ and v2’ for each primary keys p1’ and p2’ to be populated in keyspace k’ and table t’ instantiate it as follows
>>> cassandra_sensor CassandraRecordSensor(tablekt
keys{p1 v1 p2 v2}
cassandra_conn_idcassandra_default
task_idcassandra_sensor)
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorscassandra_table_sensorCassandraTableSensor(table cassandra_conn_id *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Checks for the existence of a table in a Cassandra cluster
For example if you want to wait for a table called t’ to be created in a keyspace k’ instantiate it as follows
>>> cassandra_sensor CassandraTableSensor(tablekt
cassandra_conn_idcassandra_default
task_idcassandra_sensor)
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsemr_base_sensorEmrBaseSensor(aws_conn_id'aws_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Contains general sensor behavior for EMR Subclasses should implement get_emr_response() and state_from_response() methods Subclasses should also implement NON_TERMINAL_STATES and FAILED_STATE constants
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsemr_job_flow_sensorEmrJobFlowSensor(job_flow_id *args **kwargs)[source]
Bases airflowcontribsensorsemr_base_sensorEmrBaseSensor
Asks for the state of the JobFlow until it reaches a terminal state If it fails the sensor errors failing the task
Parameters
job_flow_id (str) – job_flow_id to check the state of
class airflowcontribsensorsemr_step_sensorEmrStepSensor(job_flow_id step_id *args **kwargs)[source]
Bases airflowcontribsensorsemr_base_sensorEmrBaseSensor
Asks for the state of the step until it reaches a terminal state If it fails the sensor errors failing the task
Parameters
· job_flow_id (str) – job_flow_id which contains the step check the state of
· step_id (str) – step to check the state of
class airflowcontribsensorsfile_sensorFileSensor(filepath fs_conn_id'fs_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a file or folder to land in a filesystem
If the path given is a directory then this sensor will only return true if any files exist inside it (either directly or within a subdirectory)
Parameters
· fs_conn_id (str) – reference to the File (path) connection id
· filepath – File or folder name (relative to the base path set within the connection)
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsftp_sensorFTPSensor(path ftp_conn_id'ftp_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a file or directory to be present on FTP
Parameters
· path (str) – Remote file or directory path
· ftp_conn_id (str) – The connection to run the sensor against
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsftp_sensorFTPSSensor(path ftp_conn_id'ftp_default' *args **kwargs)[source]
Bases airflowcontribsensorsftp_sensorFTPSensor
Waits for a file or directory to be present on FTP over SSL
class airflowcontribsensorsgcs_sensorGoogleCloudStorageObjectSensor(bucket object google_cloud_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Checks for the existence of a file in Google Cloud Storage Create a new GoogleCloudStorageObjectSensor
param bucket
The Google cloud storage bucket where the object is
type bucket
str
param object
The name of the object to check in the Google cloud storage bucket
type object
str
param google_cloud_storage_conn_id
 
The connection ID to use when connecting to Google cloud storage
type google_cloud_storage_conn_id
 
str
param delegate_to
 
The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
type delegate_to
 
str
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsgcs_sensorGoogleCloudStorageObjectUpdatedSensor(bucket object ts_func google_cloud_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Checks if an object is updated in Google Cloud Storage Create a new GoogleCloudStorageObjectUpdatedSensor
param bucket
The Google cloud storage bucket where the object is
type bucket
str
param object
The name of the object to download in the Google cloud storage bucket
type object
str
param ts_func
Callback for defining the update condition The default callback returns execution_date + schedule_interval The callback takes the context as parameter
type ts_func
function
param google_cloud_storage_conn_id
 
The connection ID to use when connecting to Google cloud storage
type google_cloud_storage_conn_id
 
str
param delegate_to
 
The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
type delegate_to
 
str
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorsgcs_sensorGoogleCloudStoragePrefixSensor(bucket prefix google_cloud_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Checks for the existence of a files at prefix in Google Cloud Storage bucket Create a new GoogleCloudStorageObjectSensor
param bucket
The Google cloud storage bucket where the object is
type bucket
str
param prefix
The name of the prefix to check in the Google cloud storage bucket
type prefix
str
param google_cloud_storage_conn_id
 
The connection ID to use when connecting to Google cloud storage
type google_cloud_storage_conn_id
 
str
param delegate_to
 
The account to impersonate if any For this to work the service account making the request must have domainwide delegation enabled
type delegate_to
 
str
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorshdfs_sensorHdfsSensorFolder(be_emptyFalse *args **kwargs)[source]
Bases airflowsensorshdfs_sensorHdfsSensor
poke(context)[source]
poke for a non empty directory
Returns
Bool depending on the search criteria
class airflowcontribsensorshdfs_sensorHdfsSensorRegex(regex *args **kwargs)[source]
Bases airflowsensorshdfs_sensorHdfsSensor
poke(context)[source]
poke matching files in a directory with selfregex
Returns
Bool depending on the search criteria
class airflowcontribsensorspubsub_sensorPubSubPullSensor(project subscription max_messages5 return_immediatelyFalse ack_messagesFalse gcp_conn_id'google_cloud_default' delegate_toNone *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Pulls messages from a PubSub subscription and passes them through XCom
This sensor operator will pull up to max_messages messages from the specified PubSub subscription When the subscription returns messages the poke method’s criteria will be fulfilled and the messages will be returned from the operator and passed through XCom for downstream tasks
If ack_messages is set to True messages will be immediately acknowledged before being returned otherwise downstream tasks will be responsible for acknowledging them
project and subscription are templated so you can use variables in them
execute(context)[source]
Overridden to allow messages to be passed
poke(context)[source]
Function that the sensors defined while deriving this class should override
class airflowcontribsensorssftp_sensorSFTPSensor(path sftp_conn_id'sftp_default' *args **kwargs)[source]
Bases airflowsensorsbase_sensor_operatorBaseSensorOperator
Waits for a file or directory to be present on SFTP param path Remote file or directory path type path str param sftp_conn_id The connection to run the sensor against type sftp_conn_id str
poke(context)[source]
Function that the sensors defined while deriving this class should override
Macros
Here’s a list of variables and macros that can be used in templates
Default Variables
The Airflow engine passes a few variables by default that are accessible in all templates
Variable
Description
{{ ds }}
the execution date as YYYYMMDD
{{ ds_nodash }}
the execution date as YYYYMMDD
{{ prev_ds }}
the previous execution date as YYYYMMDD if {{ ds }} is 20160108 and schedule_interval is @weekly {{ prev_ds }} will be 20160101
{{ prev_ds_nodash }}
the previous execution date as YYYYMMDD if exists else ``None`
{{ next_ds }}
the next execution date as YYYYMMDD if {{ ds }} is 20160101 and schedule_interval is @weekly {{ next_ds }} will be 20160108
{{ next_ds_nodash }}
the next execution date as YYYYMMDD if exists else ``None`
{{ yesterday_ds }}
yesterday’s date as YYYYMMDD
{{ yesterday_ds_nodash }}
yesterday’s date as YYYYMMDD
{{ tomorrow_ds }}
tomorrow’s date as YYYYMMDD
{{ tomorrow_ds_nodash }}
tomorrow’s date as YYYYMMDD
{{ ts }}
same as execution_dateisoformat()
{{ ts_nodash }}
same as ts without and
{{ execution_date }}
the execution_date (datetimedatetime)
{{ prev_execution_date }}
the previous execution date (if available) (datetimedatetime)
{{ next_execution_date }}
the next execution date (datetimedatetime)
{{ dag }}
the DAG object
{{ task }}
the Task object
{{ macros }}
a reference to the macros package described below
{{ task_instance }}
the task_instance object
{{ end_date }}
same as {{ ds }}
{{ latest_date }}
same as {{ ds }}
{{ ti }}
same as {{ task_instance }}
{{ params }}
a reference to the userdefined params dictionary which can be overridden by the dictionary passed through trigger_dag c if you enabled dag_run_conf_overrides_params` in ``airflowcfg
{{ varvaluemy_var }}
global defined variables represented as a dictionary
{{ varjsonmy_varpath }}
global defined variables represented as a dictionary with deserialized JSON object append the path to the key within the JSON object
{{ task_instance_key_str }}
a unique humanreadable key to the task instance formatted {dag_id}_{task_id}_{ds}
{{ conf }}
the full configuration object located at airflowconfigurationconf which represents the content of your airflowcfg
{{ run_id }}
the run_id of the current DAG run
{{ dag_run }}
a reference to the DagRun object
{{ test_mode }}
whether the task instance was called using the CLI’s test subcommand
Note that you can access the object’s attributes and methods with simple dot notation Here are some examples of what is possible {{ taskowner }} {{ tasktask_id }} {{ tihostname }} … Refer to the models documentation for more information on the objects’ attributes and methods
The var template variable allows you to access variables defined in Airflow’s UI You can access them as either plaintext or JSON If you use JSON you are also able to walk nested structures such as dictionaries like {{ varjsonmy_dict_varkey1 }}
Macros
Macros are a way to expose objects to your templates and live under the macros namespace in your templates
A few commonly used libraries and methods are made available
Variable
Description
macrosdatetime
The standard lib’s datetimedatetime
macrostimedelta
The standard lib’s datetimetimedelta
macrosdateutil
A reference to the dateutil package
macrostime
The standard lib’s time
macrosuuid
The standard lib’s uuid
macrosrandom
The standard lib’s random
Some airflow specific macros are also defined
airflowmacrosds_add(ds days)[source]
Add or subtract days from a YYYYMMDD
Parameters
· ds (str) – anchor date in YYYYMMDD format to add to
· days (int) – number of days to add to the ds you can use negative values
>>> ds_add('20150101' 5)
'20150106'
>>> ds_add('20150106' 5)
'20150101'
airflowmacrosds_format(ds input_format output_format)[source]
Takes an input string and outputs another string as specified in the output format
Parameters
· ds (str) – input string which contains a date
· input_format (str) – input string format Eg Ymd
· output_format (str) – output string format Eg Ymd
>>> ds_format('20150101' Ymd mdy)
'010115'
>>> ds_format('152015' mdY Ymd)
'20150105'
airflowmacrosrandom() → x in the interval [0 1)
airflowmacroshiveclosest_ds_partition(table ds beforeTrue schema'default' metastore_conn_id'metastore_default')[source]
This function finds the date in a list closest to the target date An optional parameter can be given to get the closest before or after
Parameters
· table (str) – A hive table name
· ds (datetimedate list) – A datestamp Ymd eg yyyymmdd
· before (bool or None) – closest before (True) after (False) or either side of ds
Returns
The closest date
Return type
str or None
>>> tbl 'airflowstatic_babynames_partitioned'
>>> closest_ds_partition(tbl '20150102')
'20150101'
airflowmacroshivemax_partition(table schema'default' fieldNone filter_mapNone metastore_conn_id'metastore_default')[source]
Gets the max partition for a table
Parameters
· schema (str) – The hive schema the table lives in
· table (str) – The hive table you are interested in supports the dot notation as in my_databasemy_table if a dot is found the schema param is disregarded
· metastore_conn_id (str) – The hive connection you are interested in If your default is set you don’t need to use this parameter
· filter_map (map) – partition_keypartition_value map used for partition filtering eg {key1’ value1’ key2’ value2’} Only partitions matching all partition_keypartition_value pairs will be considered as candidates of max partition
· field (str) – the field to get the max value from If there’s only one partition field this will be inferred
>>> max_partition('airflowstatic_babynames_partitioned')
'20150101'
Model
Models are built on top of the SQLAlchemy ORM Base class and instances are persisted in the database
class airflowmodelsBaseOperator(task_id owner'Airflow' emailNone email_on_retryTrue email_on_failureTrue retries0 retry_delaydatetimetimedelta(0 300) retry_exponential_backoffFalse max_retry_delayNone start_dateNone end_dateNone schedule_intervalNone depends_on_pastFalse wait_for_downstreamFalse dagNone paramsNone default_argsNone adhocFalse priority_weight1 weight_ruleu'downstream' queue'default' poolNone slaNone execution_timeoutNone on_failure_callbackNone on_success_callbackNone on_retry_callbackNone trigger_ruleu'all_success' resourcesNone run_as_userNone task_concurrencyNone executor_configNone inletsNone outletsNone *args **kwargs)[source]
Bases airflowutilsloglogging_mixinLoggingMixin
Abstract base class for all operators Since operators create objects that become nodes in the dag BaseOperator contains many recursive methods for dag crawling behavior To derive this class you are expected to override the constructor as well as the execute’ method
Operators derived from this class should perform or trigger certain tasks synchronously (wait for completion) Example of operators could be an operator that runs a Pig job (PigOperator) a sensor operator that waits for a partition to land in Hive (HiveSensorOperator) or one that moves data from Hive to MySQL (Hive2MySqlOperator) Instances of these operators (tasks) target specific operations running specific scripts functions or data transfers
This class is abstract and shouldn’t be instantiated Instantiating a class derived from this one results in the creation of a task object which ultimately becomes a node in DAG objects Task dependencies should be set by using the set_upstream andor set_downstream methods
Parameters
· task_id (str) – a unique meaningful id for the task
· owner (str) – the owner of the task using the unix username is recommended
· retries (int) – the number of retries that should be performed before failing the task
· retry_delay (timedelta) – delay between retries
· retry_exponential_backoff (bool) – allow progressive longer waits between retries by using exponential backoff algorithm on retry delay (delay will be converted into seconds)
· max_retry_delay (timedelta) – maximum delay interval between retries
· start_date (datetime) – The start_date for the task determines the execution_date for the first task instance The best practice is to have the start_date rounded to your DAG’s schedule_interval Daily jobs have their start_date some day at 000000 hourly jobs have their start_date at 0000 of a specific hour Note that Airflow simply looks at the latest execution_date and adds the schedule_interval to determine the next execution_date It is also very important to note that different tasks’ dependencies need to line up in time If task A depends on task B and their start_date are offset in a way that their execution_date don’t line up A’s dependencies will never be met If you are looking to delay a task for example running a daily task at 2AM look into the TimeSensor and TimeDeltaSensor We advise against using dynamic start_date and recommend using fixed ones Read the FAQ entry about start_date for more information
· end_date (datetime) – if specified the scheduler won’t go beyond this date
· depends_on_past (bool) – when set to true task instances will run sequentially while relying on the previous task’s schedule to succeed The task instance for the start_date is allowed to run
· wait_for_downstream (bool) – when set to true an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully before it runs This is useful if the different instances of a task X alter the same asset and this asset is used by tasks downstream of task X Note that depends_on_past is forced to True wherever wait_for_downstream is used
· queue (str) – which queue to target when running this job Not all executors implement queue management the CeleryExecutor does support targeting specific queues
· dag (DAG) – a reference to the dag the task is attached to (if any)
· priority_weight (int) – priority weight of this task against other task This allows the executor to trigger higher priority tasks before others when things get backed up Set priority_weight as a higher number for more important tasks
· weight_rule (str) – weighting method used for the effective total priority weight of the task Options are { downstream | upstream | absolute } default is downstream When set to downstream the effective weight of the task is the aggregate sum of all downstream descendants As a result upstream tasks will have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and desire to have all upstream tasks to complete for all runs before each dag can continue processing downstream tasks When set to upstream the effective weight is the aggregate sum of all upstream ancestors This is the opposite where downtream tasks have higher weight and will be scheduled more aggressively when using positive weight values This is useful when you have multiple dag run instances and prefer to have each dag complete before starting upstream tasks of other dags When set to absolute the effective weight is the exact priority_weight specified without additional weighting You may want to do this when you know exactly what priority weight each task should have Additionally when set to absolute there is bonus effect of significantly speeding up the task creation process as for very large DAGS Options can be set as string or using the constants defined in the static class airflowutilsWeightRule
· pool (str) – the slot pool this task should run in slot pools are a way to limit concurrency for certain tasks
· sla (datetimetimedelta) – time by which the job is expected to succeed Note that this represents the timedelta after the period is closed For example if you set an SLA of 1 hour the scheduler would send an email soon after 100AM on the 20160102 if the 20160101 instance has not succeeded yet The scheduler pays special attention for jobs with an SLA and sends alert emails for sla misses SLA misses are also recorded in the database for future reference All tasks that share the same SLA time get bundled in a single email sent soon after that time SLA notification are sent once and only once for each task instance
· execution_timeout (datetimetimedelta) – max time allowed for the execution of this task instance if it goes beyond it will raise and fail
· on_failure_callback (callable) – a function to be called when a task instance of this task fails a context dictionary is passed as a single parameter to this function Context contains references to related objects to the task instance and is documented under the macros section of the API
· on_retry_callback (callable) – much like the on_failure_callback except that it is executed when retries occur
· on_success_callback (callable) – much like the on_failure_callback except that it is executed when the task succeeds
· trigger_rule (str) – defines the rule by which dependencies are applied for the task to get triggered Options are { all_success | all_failed | all_done | one_success | one_failed | dummy} default is all_success Options can be set as string or using the constants defined in the static class airflowutilsTriggerRule
· resources (dict) – A map of resource parameter names (the argument names of the Resources constructor) to their values
· run_as_user (str) – unix username to impersonate while running the task
· task_concurrency (int) – When set a task will be able to limit the concurrent runs across execution_dates
· executor_config (dict) –
Additional tasklevel configuration parameters that are interpreted by a specific executor Parameters are namespaced by the name of executor
Example to run this task in a specific docker container through the KubernetesExecutor
MyOperator(
executor_config{
KubernetesExecutor
{image myCustomDockerImage}
}
)
clear(**kwargs)[source]
Clears the state of task instances associated with the task following the parameters specified
dag
Returns the Operator’s DAG if set otherwise raises an error
deps
Returns the list of dependencies for the operator These differ from execution context dependencies in that they are specific to tasks and can be extendedoverridden by subclasses
downstream_list
@property list of tasks directly downstream
execute(context)[source]
This is the main method to derive when creating an operator Context is the same dictionary used as when rendering jinja templates
Refer to get_template_context for more context
get_direct_relative_ids(upstreamFalse)[source]
Get the direct relative ids to the current task upstream or downstream
get_direct_relatives(upstreamFalse)[source]
Get the direct relatives to the current task upstream or downstream
get_flat_relative_ids(upstreamFalse found_descendantsNone)[source]
Get a flat list of relatives’ ids either upstream or downstream
get_flat_relatives(upstreamFalse)[source]
Get a flat list of relatives either upstream or downstream
get_task_instances(session start_dateNone end_dateNone)[source]
Get a set of task instance related to this task for a specific date range
has_dag()[source]
Returns True if the Operator has been assigned to a DAG
on_kill()[source]
Override this method to cleanup subprocesses when a task instance gets killed Any use of the threading subprocess or multiprocessing module within an operator needs to be cleaned up or it will leave ghost processes behind
post_execute(context *args **kwargs)[source]
This hook is triggered right after selfexecute() is called It is passed the execution context and any results returned by the operator
pre_execute(context *args **kwargs)[source]
This hook is triggered right before selfexecute() is called
prepare_template()[source]
Hook that is triggered after the templated fields get replaced by their content If you need your operator to alter the content of the file before the template is rendered it should override this method to do so
render_template(attr content context)[source]
Renders a template either from a file or directly in a field and returns the rendered result
render_template_from_field(attr content context jinja_env)[source]
Renders a template from a field If the field is a string it will simply render the string and return the result If it is a collection or nested set of collections it will traverse the structure and render all strings in it
run(start_dateNone end_dateNone ignore_first_depends_on_pastFalse ignore_ti_stateFalse mark_successFalse)[source]
Run a set of task instances for a date range
schedule_interval
The schedule interval of the DAG always wins over individual tasks so that tasks within a DAG always line up The task still needs a schedule_interval as it may not be attached to a DAG
set_downstream(task_or_task_list)[source]
Set a task or a task list to be directly downstream from the current task
set_upstream(task_or_task_list)[source]
Set a task or a task list to be directly upstream from the current task
upstream_list
@property list of tasks directly upstream
xcom_pull(context task_idsNone dag_idNone keyu'return_value' include_prior_datesNone)[source]
See TaskInstancexcom_pull()
xcom_push(context key value execution_dateNone)[source]
See TaskInstancexcom_push()
class airflowmodelsChart(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsConnection(conn_idNone conn_typeNone hostNone loginNone passwordNone schemaNone portNone extraNone uriNone)[source]
Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
Placeholder to store information about different database instances connection information The idea here is that scripts use references to database instances (conn_id) instead of hard coding hostname logins and passwords when using operators or hooks
extra_dejson
Returns the extra property by deserializing json
class airflowmodelsDAG(dag_id descriptionu'' schedule_intervaldatetimetimedelta(1) start_dateNone end_dateNone full_filepathNone template_searchpathNone user_defined_macrosNone user_defined_filtersNone default_argsNone concurrency16 max_active_runs16 dagrun_timeoutNone sla_miss_callbackNone default_viewu'tree' orientation'LR' catchupTrue on_success_callbackNone on_failure_callbackNone paramsNone)[source]
Bases airflowdagbase_dagBaseDag airflowutilsloglogging_mixinLoggingMixin
A dag (directed acyclic graph) is a collection of tasks with directional dependencies A dag also has a schedule a start date and an end date (optional) For each schedule (say daily or hourly) the DAG needs to run each individual tasks as their dependencies are met Certain tasks have the property of depending on their own past meaning that they can’t run until their previous schedule (and upstream tasks) are completed
DAGs essentially act as namespaces for tasks A task_id can only be added once to a DAG
Parameters
· dag_id (str) – The id of the DAG
· description (str) – The description for the DAG to eg be shown on the webserver
· schedule_interval (datetimetimedelta or dateutilrelativedeltarelativedelta or str that acts as a cron expression) – Defines how often that DAG runs this timedelta object gets added to your latest task instance’s execution_date to figure out the next schedule
· start_date (datetimedatetime) – The timestamp from which the scheduler will attempt to backfill
· end_date (datetimedatetime) – A date beyond which your DAG won’t run leave to None for open ended scheduling
· template_searchpath (str or list of stings) – This list of folders (non relative) defines where jinja will look for your templates Order matters Note that jinjaairflow includes the path of your DAG file by default
· user_defined_macros (dict) – a dictionary of macros that will be exposed in your jinja templates For example passing dict(foo'bar') to this argument allows you to {{ foo }} in all jinja templates related to this DAG Note that you can pass any type of object here
· user_defined_filters (dict) – a dictionary of filters that will be exposed in your jinja templates For example passing dict(hellolambda name 'Hello s' name) to this argument allows you to {{ 'world' | hello }} in all jinja templates related to this DAG
· default_args (dict) – A dictionary of default parameters to be used as constructor keyword parameters when initialising operators Note that operators have the same hook and precede those defined here meaning that if your dict contains depends_on_past’ True here and depends_on_past’ False in the operator’s call default_args the actual value will be False
· params (dict) – a dictionary of DAG level parameters that are made accessible in templates namespaced under params These params can be overridden at the task level
· concurrency (int) – the number of task instances allowed to run concurrently
· max_active_runs (int) – maximum number of active DAG runs beyond this number of DAG runs in a running state the scheduler won’t create new active DAG runs
· dagrun_timeout (datetimetimedelta) – specify how long a DagRun should be up before timing out failing so that new DagRuns can be created
· sla_miss_callback (typesFunctionType) – specify a function to call when reporting SLA timeouts
· default_view (str) – Specify DAG default view (tree graph duration gantt landing_times)
· orientation (str) – Specify DAG orientation in graph view (LR TB RL BT)
· catchup (bool) – Perform scheduler catchup (or only run latest) Defaults to True
· on_failure_callback (callable) – A function to be called when a DagRun of this dag fails A context dictionary is passed as a single parameter to this function
· on_success_callback (callable) – Much like the on_failure_callback except that it is executed when the dag succeeds
add_task(task)[source]
Add a task to the DAG
Parameters
task (task) – the task you want to add
add_tasks(tasks)[source]
Add a list of tasks to the DAG
Parameters
tasks (list of tasks) – a lit of tasks you want to add
clear(**kwargs)[source]
Clears a set of task instances associated with the current dag for a specified date range
cli()[source]
Exposes a CLI specific to this DAG
concurrency_reached
Returns a boolean indicating whether the concurrency limit for this DAG has been reached
create_dagrun(**kwargs)[source]
Creates a dag run from this dag including the tasks associated with this dag Returns the dag run
Parameters
· run_id (str) – defines the the run id for this dag run
· execution_date (datetime) – the execution date of this dag run
· state (State) – the state of the dag run
· start_date (datetime) – the date this dag run should be evaluated
· external_trigger (bool) – whether this dag run is externally triggered
· session (Session) – database session
static deactivate_stale_dags(*args **kwargs)[source]
Deactivate any DAGs that were last touched by the scheduler before the expiration date These DAGs were likely deleted
Parameters
expiration_date (datetime) – set inactive DAGs that were touched before this time
Returns
None
static deactivate_unknown_dags(*args **kwargs)[source]
Given a list of known DAGs deactivate any other DAGs that are marked as active in the ORM
Parameters
active_dag_ids (list[unicode]) – list of DAG IDs that are active
Returns
None
filepath
File location of where the dag object is instantiated
folder
Folder location of where the dag object is instantiated
following_schedule(dttm)[source]
Calculates the following schedule for this dag in local time
Parameters
dttm – utc datetime
Returns
utc datetime
get_active_runs(**kwargs)[source]
Returns a list of dag run execution dates currently running
Parameters
session –
Returns
List of execution dates
get_dagrun(**kwargs)[source]
Returns the dag run for a given execution date if it exists otherwise none
Parameters
· execution_date – The execution date of the DagRun to find
· session –
Returns
The DagRun if found otherwise None
get_last_dagrun(**kwargs)[source]
Returns the last dag run for this dag None if there was none Last dag run can be any type of run eg scheduled or backfilled Overridden DagRuns are ignored
get_num_active_runs(**kwargs)[source]
Returns the number of active running dag runs
Parameters
· external_trigger (bool) – True for externally triggered active dag runs
· session –
Returns
number greater than 0 for active dag runs
static get_num_task_instances(*args **kwargs)[source]
Returns the number of task instances in the given DAG
Parameters
· session – ORM session
· dag_id (unicode) – ID of the DAG to get the task concurrency of
· task_ids (list[unicode]) – A list of valid task IDs for the given DAG
· states (list[state]) – A list of states to filter by if supplied
Returns
The number of running tasks
Return type
int
get_run_dates(start_date end_dateNone)[source]
Returns a list of dates between the interval received as parameter using this dag’s schedule interval Returned dates can be used for execution dates
Parameters
· start_date (datetime) – the start date of the interval
· end_date (datetime) – the end date of the interval defaults to timezoneutcnow()
Returns
a list of dates within the interval following the dag’s schedule
Return type
list
get_template_env()[source]
Returns a jinja2 Environment while taking into account the DAGs template_searchpath user_defined_macros and user_defined_filters
handle_callback(**kwargs)[source]
Triggers the appropriate callback depending on the value of success namely the on_failure_callback or on_success_callback This method gets the context of a single TaskInstance part of this DagRun and passes that to the callable along with a reason’ primarily to differentiate DagRun failures note
The logs end up in AIRFLOW_HOMElogsschedulerlatestPROJECTDAG_FILEpylog
Parameters
· dagrun – DagRun object
· success – Flag to specify if failure or success callback should be called
· reason – Completion reason
· session – Database session
is_paused
Returns a boolean indicating whether this DAG is paused
latest_execution_date
Returns the latest date for which at least one dag run exists
normalize_schedule(dttm)[source]
Returns dttm + interval unless dttm is first interval then it returns dttm
previous_schedule(dttm)[source]
Calculates the previous schedule for this dag in local time
Parameters
dttm – utc datetime
Returns
utc datetime
run(start_dateNone end_dateNone mark_successFalse localFalse executorNone donot_pickleFalse ignore_task_depsFalse ignore_first_depends_on_pastFalse poolNone delay_on_limit_secs10 verboseFalse confNone rerun_failed_tasksFalse)[source]
Runs the DAG
Parameters
· start_date (datetime) – the start date of the range to run
· end_date (datetime) – the end date of the range to run
· mark_success (bool) – True to mark jobs as succeeded without running them
· local (bool) – True to run the tasks using the LocalExecutor
· executor (BaseExecutor) – The executor instance to run the tasks
· donot_pickle (bool) – True to avoid pickling DAG object and send to workers
· ignore_task_deps (bool) – True to skip upstream tasks
· ignore_first_depends_on_past (bool) – True to ignore depends_on_past dependencies for the first set of tasks only
· pool (str) – Resource pool to use
· delay_on_limit_secs (float) – Time in seconds to wait before next attempt to run dag run when max_active_runs limit has been reached
· verbose (bool) – Make logging output more verbose
· conf (dict) – user defined dictionary passed from CLI
set_dependency(upstream_task_id downstream_task_id)[source]
Simple utility method to set dependency between two tasks that already have been added to the DAG using add_task()
sub_dag(task_regex include_downstreamFalse include_upstreamTrue)[source]
Returns a subset of the current dag as a deep copy of the current dag based on a regex that should match one or many tasks and includes upstream and downstream neighbours based on the flag passed
subdags
Returns a list of the subdag objects associated to this DAG
sync_to_db(**kwargs)[source]
Save attributes about this DAG to the DB Note that this method can be called for both DAGs and SubDAGs A SubDag is actually a SubDagOperator
Parameters
· dag (DAG) – the DAG object to save to the DB
· sync_time (datetime) – The time that the DAG should be marked as sync’ed
Returns
None
test_cycle()[source]
Check to see if there are any cycles in the DAG Returns False if no cycle found otherwise raises exception
topological_sort()[source]
Sorts tasks in topographical order such that a task comes after any of its upstream dependencies
Heavily inspired by httpblogjupoorg20120406topologicalsortingacyclicdirectedgraphs
Returns
list of tasks in topological order
tree_view()[source]
Shows an ascii tree representation of the DAG
class airflowmodelsDagBag(dag_folderNone executorNone include_examplesTrue)[source]
Bases airflowdagbase_dagBaseDagBag airflowutilsloglogging_mixinLoggingMixin
A dagbag is a collection of dags parsed out of a folder tree and has high level configuration settings like what database to use as a backend and what executor to use to fire off tasks This makes it easier to run distinct environments for say production and development tests or for different teams or security profiles What would have been system level settings are now dagbag level so that one system can run multiple independent settings sets
Parameters
· dag_folder (unicode) – the folder to scan to find DAGs
· executor – the executor to use when executing task instances in this DagBag
· include_examples (bool) – whether to include the examples that ship with airflow or not
· has_logged – an instance boolean that gets flipped from False to True after a file has been skipped This is to prevent overloading the user with logging messages about skipped files Therefore only once per DagBag is a file logged being skipped
bag_dag(dag parent_dag root_dag)[source]
Adds the DAG into the bag recurses into sub dags Throws AirflowDagCycleException if a cycle is detected in this dag or its subdags
collect_dags(dag_folderNone only_if_updatedTrue)[source]
Given a file path or a folder this method looks for python modules imports them and adds them to the dagbag collection
Note that if a airflowignore file is found while processing the directory it will behave much like a gitignore ignoring files that match any of the regex patterns specified in the file
Note The patterns in airflowignore are treated as unanchored regexes not shelllike glob patterns
dagbag_report()[source]
Prints a report around DagBag loading stats
get_dag(dag_id)[source]
Gets the DAG out of the dictionary and refreshes it if expired
kill_zombies(**kwargs)[source]
Fails tasks that haven’t had a heartbeat in too long
process_file(filepath only_if_updatedTrue safe_modeTrue)[source]
Given a path to a python module or zip file this method imports the module and look for dag objects within it
size()[source]
Returns
the amount of dags contained in this dagbag
class airflowmodelsDagModel(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsDagPickle(dag)[source]
Bases sqlalchemyextdeclarativeapiBase
Dags can originate from different places (user repos master repo …) and also get executed in different places (different executors) This object represents a version of a DAG and becomes a source of truth for a BackfillJob execution A pickle is a native python serialized object and in this case gets stored in the database for the duration of the job
The executors pick up the DagPickle id and read the dag definition from the database
class airflowmodelsDagRun(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
DagRun describes an instance of a Dag It can be created by the scheduler (for regular runs) or by an external trigger
static find(*args **kwargs)[source]
Returns a set of dag runs for the given search criteria
Parameters
· dag_id (int list) – the dag_id to find dag runs for
· run_id (str) – defines the the run id for this dag run
· execution_date (datetime) – the execution date
· state (State) – the state of the dag run
· external_trigger (bool) – whether this dag run is externally triggered
· no_backfills – return no backfills (True) return all (False)
Defaults to False type no_backfills bool param session database session type session Session
get_dag()[source]
Returns the Dag associated with this DagRun
Returns
DAG
classmethod get_latest_runs(**kwargs)[source]
Returns the latest DagRun for each DAG
get_previous_dagrun(**kwargs)[source]
The previous DagRun if there is one
get_previous_scheduled_dagrun(**kwargs)[source]
The previous SCHEDULED DagRun if there is one
static get_run(session dag_id execution_date)[source]
Parameters
· dag_id (unicode) – DAG ID
· execution_date (datetime) – execution date
Returns
DagRun corresponding to the given dag_id and execution date
if one exists None otherwise rtype DagRun
get_task_instance(**kwargs)[source]
Returns the task instance specified by task_id for this dag run
Parameters
task_id – the task id
get_task_instances(**kwargs)[source]
Returns the task instances for this dag run
refresh_from_db(**kwargs)[source]
Reloads the current dagrun from the database param session database session
update_state(**kwargs)[source]
Determines the overall state of the DagRun based on the state of its TaskInstances
Returns
State
verify_integrity(**kwargs)[source]
Verifies the DagRun by checking for removed tasks or tasks that are not in the database yet It will set state to removed or add the task if required
class airflowmodelsDagStat(dag_id state count0 dirtyFalse)[source]
Bases sqlalchemyextdeclarativeapiBase
static create(*args **kwargs)[source]
Creates the missing states the stats table for the dag specified
Parameters
· dag_id – dag id of the dag to create stats for
· session – database session
Returns

static set_dirty(*args **kwargs)[source]
Parameters
· dag_id – the dag_id to mark dirty
· session – database session
Returns

static update(*args **kwargs)[source]
Updates the stats for dirtyoutofsync dags
Parameters
· dag_ids (list) – dag_ids to be updated
· dirty_only (bool) – only updated for marked dirty defaults to True
· session (Session) – db session to use
class airflowmodelsImportError(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
exception airflowmodelsInvalidFernetToken[source]
Bases exceptionsException
class airflowmodelsKnownEvent(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsKnownEventType(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsKubeResourceVersion(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsKubeWorkerIdentifier(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsLog(event task_instance ownerNone extraNone **kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
Used to actively log events to the database
class airflowmodelsNullFernet[source]
Bases futuretypesnewobjectnewobject
A Null encryptor class that doesn’t encrypt or decrypt but that presents a similar interface to Fernet
The purpose of this is to make the rest of the code not have to know the difference and to only display the message once not 20 times when airflow initdb is ran
class airflowmodelsPool(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
open_slots(**kwargs)[source]
Returns the number of slots open at the moment
queued_slots(**kwargs)[source]
Returns the number of slots used at the moment
used_slots(**kwargs)[source]
Returns the number of slots used at the moment
class airflowmodelsSlaMiss(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
Model that stores a history of the SLA that have been missed It is used to keep track of SLA failures over time and to avoid double triggering alert emails
class airflowmodelsTaskFail(task execution_date start_date end_date)[source]
Bases sqlalchemyextdeclarativeapiBase
TaskFail tracks the failed run durations of each task instance
class airflowmodelsTaskInstance(task execution_date stateNone)[source]
Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
Task instances store the state of a task instance This table is the authority and single source of truth around what tasks have run and the state they are in
The SqlAlchemy model doesn’t have a SqlAlchemy foreign key to the task or dag model deliberately to have more control over transactions
Database transactions on this table should insure double triggers and any confusion around what task instances are or aren’t ready to run even while multiple schedulers may be firing task instances
are_dependencies_met(**kwargs)[source]
Returns whether or not all the conditions are met for this task instance to be run given the context for the dependencies (eg a task instance being force run from the UI will ignore some dependencies)
Parameters
· dep_context (DepContext) – The execution context that determines the dependencies that should be evaluated
· session (Session) – database session
· verbose (bool) – whether log details on failed dependencies on info or debug log level
are_dependents_done(**kwargs)[source]
Checks whether the dependents of this task instance have all succeeded This is meant to be used by wait_for_downstream
This is useful when you do not want to start processing the next schedule of a task until the dependents are done For instance if the task DROPs and recreates a table
clear_xcom_data(**kwargs)[source]
Clears all XCom data from the database for the task instance
command(mark_successFalse ignore_all_depsFalse ignore_depends_on_pastFalse ignore_task_depsFalse ignore_ti_stateFalse localFalse pickle_idNone rawFalse job_idNone poolNone cfg_pathNone)[source]
Returns a command that can be executed anywhere where airflow is installed This command is part of the message sent to executors by the orchestrator
command_as_list(mark_successFalse ignore_all_depsFalse ignore_task_depsFalse ignore_depends_on_pastFalse ignore_ti_stateFalse localFalse pickle_idNone rawFalse job_idNone poolNone cfg_pathNone)[source]
Returns a command that can be executed anywhere where airflow is installed This command is part of the message sent to executors by the orchestrator
current_state(**kwargs)[source]
Get the very latest state from the database if a session is passed we use and looking up the state becomes part of the session otherwise a new session is used
error(**kwargs)[source]
Forces the task instance’s state to FAILED in the database
static generate_command(dag_id task_id execution_date mark_successFalse ignore_all_depsFalse ignore_depends_on_pastFalse ignore_task_depsFalse ignore_ti_stateFalse localFalse pickle_idNone file_pathNone rawFalse job_idNone poolNone cfg_pathNone)[source]
Generates the shell command required to execute this task instance
Parameters
· dag_id (unicode) – DAG ID
· task_id (unicode) – Task ID
· execution_date (datetime) – Execution date for the task
· mark_success (bool) – Whether to mark the task as successful
· ignore_all_deps (bool) – Ignore all ignorable dependencies Overrides the other ignore_* parameters
· ignore_depends_on_past (bool) – Ignore depends_on_past parameter of DAGs (eg for Backfills)
· ignore_task_deps (bool) – Ignore taskspecific dependencies such as depends_on_past and trigger rule
· ignore_ti_state (bool) – Ignore the task instance’s previous failuresuccess
· local (bool) – Whether to run the task locally
· pickle_id (unicode) – If the DAG was serialized to the DB the ID associated with the pickled DAG
· file_path – path to the file containing the DAG definition
· raw – raw mode (needs more details)
· job_id – job ID (needs more details)
· pool (unicode) – the Airflow pool that the task should run in
· cfg_path (basestring) – the Path to the configuration file
Returns
shell command that can be used to run the task instance
get_dagrun(**kwargs)[source]
Returns the DagRun for this TaskInstance
Parameters
session –
Returns
DagRun
init_on_load()[source]
Initialize the attributes that aren’t stored in the DB
init_run_context(rawFalse)[source]
Sets the log context
is_eligible_to_retry()[source]
Is task instance is eligible for retry
is_premature
Returns whether a task is in UP_FOR_RETRY state and its retry interval has elapsed
key
Returns a tuple that identifies the task instance uniquely
next_retry_datetime()[source]
Get datetime of the next retry if the task instance fails For exponential backoff retry_delay is used as base and will be converted to seconds
pool_full(**kwargs)[source]
Returns a boolean as to whether the slot pool has room for this task to run
previous_ti
The task instance for the task that ran before this task instance
ready_for_retry()[source]
Checks on whether the task instance is in the right state and timeframe to be retried
refresh_from_db(**kwargs)[source]
Refreshes the task instance from the database based on the primary key
Parameters
lock_for_update – if True indicates that the database should lock the TaskInstance (issuing a FOR UPDATE clause) until the session is committed
try_number
Return the try number that this task number will be when it is actually run
If the TI is currently running this will match the column in the databse in all othercases this will be incremenetd
xcom_pull(task_idsNone dag_idNone keyu'return_value' include_prior_datesFalse)[source]
Pull XComs that optionally meet certain criteria
The default value for key limits the search to XComs that were returned by other tasks (as opposed to those that were pushed manually) To remove this filter pass keyNone (or any desired value)
If a single task_id string is provided the result is the value of the most recent matching XCom from that task_id If multiple task_ids are provided a tuple of matching values is returned None is returned whenever no matches are found
Parameters
· key (str) – A key for the XCom If provided only XComs with matching keys will be returned The default key is return_value’ also available as a constant XCOM_RETURN_KEY This key is automatically given to XComs returned by tasks (as opposed to being pushed manually) To remove the filter pass keyNone
· task_ids (str or iterable of strings (representing task_ids)) – Only XComs from tasks with matching ids will be pulled Can pass None to remove the filter
· dag_id (str) – If provided only pulls XComs from this DAG If None (default) the DAG of the calling task is used
· include_prior_dates (bool) – If False only XComs from the current execution_date are returned If True XComs from previous dates are returned as well
xcom_push(key value execution_dateNone)[source]
Make an XCom available for tasks to pull
Parameters
· key (str) – A key for the XCom
· value (any pickleable object) – A value for the XCom The value is pickled and stored in the database
· execution_date (datetime) – if provided the XCom will not be visible until this date This can be used for example to send a message to a task on a future date without it being immediately visible
class airflowmodelsTaskReschedule(task execution_date try_number start_date end_date reschedule_date)[source]
Bases sqlalchemyextdeclarativeapiBase
TaskReschedule tracks rescheduled task instances
static find_for_task_instance(*args **kwargs)[source]
Returns all task reschedules for the task instance and try number in ascending order
Parameters
task_instance (TaskInstance) – the task instance to find task reschedules for
class airflowmodelsUser(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase
class airflowmodelsVariable(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
classmethod setdefault(key default deserialize_jsonFalse)[source]
Like a Python builtin dict object setdefault returns the current value for a key and if it isn’t there stores the default value and returns it
Parameters
· key (String) – Dict key for this Variable
· default – Default value to set and return if the variable
isn’t already in the DB type default Mixed param deserialize_json Store this as a JSON encoded value in the DB
and unencode it when retrieving a value
Returns
Mixed
class airflowmodelsXCom(**kwargs)[source]
Bases sqlalchemyextdeclarativeapiBase airflowutilsloglogging_mixinLoggingMixin
Base class for XCom objects
classmethod get_many(**kwargs)[source]
Retrieve an XCom value optionally meeting certain criteria TODO pickling has been deprecated and JSON is preferred
pickling will be removed in Airflow 20
classmethod get_one(**kwargs)[source]
Retrieve an XCom value optionally meeting certain criteria TODO pickling has been deprecated and JSON is preferred
pickling will be removed in Airflow 20
Returns
XCom value
classmethod set(**kwargs)[source]
Store an XCom value TODO pickling has been deprecated and JSON is preferred
pickling will be removed in Airflow 20
Returns
None
airflowmodelsclear_task_instances(tis session activate_dag_runsTrue dagNone)[source]
Clears a set of task instances but makes sure the running ones get killed
Parameters
· tis – a list of task instances
· session – current session
· activate_dag_runs – flag to check for active dag run
· dag – DAG object
airflowmodelsget_fernet()[source]
Deferred load of Fernet key
This function could fail either because Cryptography is not installed or because the Fernet key is invalid
Returns
Fernet object
Raises
AirflowException if there’s a problem trying to load Fernet
Hook
Hooks are interfaces to external platforms and databases implementing a common interface when possible and acting as building blocks for operators
class airflowhooksdbapi_hookDbApiHook(*args **kwargs)[source]
Bases airflowhooksbase_hookBaseHook
Abstract base class for sql hooks
bulk_dump(table tmp_file)[source]
Dumps a database table into a tabdelimited file
Parameters
· table (str) – The name of the source table
· tmp_file (str) – The path of the target file
bulk_load(table tmp_file)[source]
Loads a tabdelimited file into a database table
Parameters
· table (str) – The name of the target table
· tmp_file (str) – The path of the file to load into the table
get_autocommit(conn)[source]
Get autocommit setting for the provided connection Return True if connautocommit is set to True Return False if connautocommit is not set or set to False or conn does not support autocommit
Parameters
conn (connection object) – Connection to get autocommit setting from
Returns
connection autocommit setting
rtype bool
get_conn()[source]
Returns a connection object
get_cursor()[source]
Returns a cursor
get_first(sql parametersNone)[source]
Executes the sql and returns the first resulting row
Parameters
· sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with
get_pandas_df(sql parametersNone)[source]
Executes the sql and returns a pandas dataframe
Parameters
· sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with
get_records(sql parametersNone)[source]
Executes the sql and returns a set of records
Parameters
· sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with
insert_rows(table rows target_fieldsNone commit_every1000 replaceFalse)[source]
A generic way to insert a set of tuples into a table a new transaction is created every commit_every rows
Parameters
· table (str) – Name of the target table
· rows (iterable of tuples) – The rows to insert into the table
· target_fields (iterable of strings) – The names of the columns to fill in the table
· commit_every (int) – The maximum number of rows to insert in one transaction Set to 0 to insert all rows in one transaction
· replace (bool) – Whether to replace instead of insert
run(sql autocommitFalse parametersNone)[source]
Runs a command or a list of commands Pass a list of sql statements to the sql parameter to get them to execute sequentially
Parameters
· sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
· autocommit (bool) – What to set the connection’s autocommit setting to before executing the query
· parameters (mapping or iterable) – The parameters to render the SQL query with
set_autocommit(conn autocommit)[source]
Sets the autocommit flag on the connection
class airflowhooksdocker_hookDockerHook(docker_conn_id'docker_default' base_urlNone versionNone tlsNone)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
Interact with a private Docker registry
Parameters
docker_conn_id (str) – ID of the Airflow connection where credentials and extra configuration are stored
class airflowhookshive_hooksHiveCliHook(hive_cli_conn_idu'hive_cli_default' run_asNone mapred_queueNone mapred_queue_priorityNone mapred_job_nameNone)[source]
Bases airflowhooksbase_hookBaseHook
Simple wrapper around the hive CLI
It also supports the beeline a lighter CLI that runs JDBC and is replacing the heavier traditional CLI To enable beeline set the use_beeline param in the extra field of your connection as in { use_beeline true }
Note that you can also set default hive CLI parameters using the hive_cli_params to be used in your connection as in {hive_cli_params hiveconf mapredjobtrackersomejobtracker444} Parameters passed here can be overridden by run_cli’s hive_conf param
The extra connection parameter auth gets passed as in the jdbc connection string as is
Parameters
· mapred_queue (str) – queue used by the Hadoop Scheduler (Capacity or Fair)
· mapred_queue_priority (str) – priority within the job queue Possible settings include VERY_HIGH HIGH NORMAL LOW VERY_LOW
· mapred_job_name (str) – This name will appear in the jobtracker This can make monitoring easier
load_df(df table field_dictNone delimiteru' ' encodingu'utf8' pandas_kwargsNone **kwargs)[source]
Loads a pandas DataFrame into hive
Hive data types will be inferred if not passed but column names will not be sanitized
Parameters
· df (DataFrame) – DataFrame to load into a Hive table
· table (str) – target Hive table use dot notation to target a specific database
· field_dict (OrderedDict) – mapping from column name to hive data type Note that it must be OrderedDict so as to keep columns’ order
· delimiter (str) – field delimiter in the file
· encoding (str) – str encoding to use when writing DataFrame to file
· pandas_kwargs (dict) – passed to DataFrameto_csv
· kwargs – passed to selfload_file
load_file(filepath table delimiteru' ' field_dictNone createTrue overwriteTrue partitionNone recreateFalse tblpropertiesNone)[source]
Loads a local file into Hive
Note that the table generated in Hive uses STORED AS textfile which isn’t the most efficient serialization format If a large amount of data is loaded andor if the tables gets queried considerably you may want to use this operator only to stage the data into a temporary table before loading it into its final destination using a HiveOperator
Parameters
· filepath (str) – local filepath of the file to load
· table (str) – target Hive table use dot notation to target a specific database
· delimiter (str) – field delimiter in the file
· field_dict (OrderedDict) – A dictionary of the fields name in the file as keys and their Hive types as values Note that it must be OrderedDict so as to keep columns’ order
· create (bool) – whether to create the table if it doesn’t exist
· overwrite (bool) – whether to overwrite the data in table or partition
· partition (dict) – target partition as a dict of partition columns and values
· recreate (bool) – whether to drop and recreate the table at every execution
· tblproperties (dict) – TBLPROPERTIES of the hive table being created
run_cli(hql schemaNone verboseTrue hive_confNone)[source]
Run an hql statement using the hive cli If hive_conf is specified it should be a dict and the entries will be set as keyvalue pairs in HiveConf
Parameters
hive_conf (dict) – if specified these key value pairs will be passed to hive as hiveconf keyvalue Note that they will be passed after the hive_cli_params and thus will override whatever values are specified in the database
>>> hh HiveCliHook()
>>> result hhrun_cli(USE airflow)
>>> (OK in result)
True
test_hql(hql)[source]
Test an hql statement using the hive cli and EXPLAIN
class airflowhookshive_hooksHiveMetastoreHook(metastore_conn_idu'metastore_default')[source]
Bases airflowhooksbase_hookBaseHook
Wrapper to interact with the Hive Metastore
check_for_named_partition(schema table partition_name)[source]
Checks whether a partition with a given name exists
Parameters
· schema (str) – Name of hive schema (database) @table belongs to
· table – Name of hive table @partition belongs to
Partition
Name of the partitions to check for (eg abcd)
Return type
bool
>>> hh HiveMetastoreHook()
>>> t 'static_babynames_partitioned'
>>> hhcheck_for_named_partition('airflow' t ds20150101)
True
>>> hhcheck_for_named_partition('airflow' t dsxxx)
False
check_for_partition(schema table partition)[source]
Checks whether a partition exists
Parameters
· schema (str) – Name of hive schema (database) @table belongs to
· table – Name of hive table @partition belongs to
Partition
Expression that matches the partitions to check for (eg a b’ AND c d’)
Return type
bool
>>> hh HiveMetastoreHook()
>>> t 'static_babynames_partitioned'
>>> hhcheck_for_partition('airflow' t ds'20150101')
True
get_databases(patternu'*')[source]
Get a metastore table object
get_metastore_client()[source]
Returns a Hive thrift client
get_partitions(schema table_name filterNone)[source]
Returns a list of all partitions in a table Works only for tables with less than 32767 (java short max val) For subpartitioned table the number might easily exceed this
>>> hh HiveMetastoreHook()
>>> t 'static_babynames_partitioned'
>>> parts hhget_partitions(schema'airflow' table_namet)
>>> len(parts)
1
>>> parts
[{'ds' '20150101'}]
get_table(table_name dbu'default')[source]
Get a metastore table object
>>> hh HiveMetastoreHook()
>>> t hhget_table(db'airflow' table_name'static_babynames')
>>> ttableName
'static_babynames'
>>> [colname for col in tsdcols]
['state' 'year' 'name' 'gender' 'num']
get_tables(db patternu'*')[source]
Get a metastore table object
max_partition(schema table_name fieldNone filter_mapNone)[source]
Returns the maximum value for all partitions with given field in a table If only one partition key exist in the table the key will be used as field filter_map should be a partition_keypartition_value map and will be used to filter out partitions
Parameters
· schema (str) – schema name
· table_name (str) – table name
· field (str) – partition key to get max partition from
· filter_map (map) – partition_keypartition_value map used for partition filtering
>>> hh HiveMetastoreHook()
>>> filter_map {'ds' '20150101' 'ds' '20140101'}
>>> t 'static_babynames_partitioned'
>>> hhmax_partition(schema'airflow' table_namet field'ds' filter_mapfilter_map)
'20150101'
table_exists(table_name dbu'default')[source]
Check if table exists
>>> hh HiveMetastoreHook()
>>> hhtable_exists(db'airflow' table_name'static_babynames')
True
>>> hhtable_exists(db'airflow' table_name'does_not_exist')
False
class airflowhookshive_hooksHiveServer2Hook(hiveserver2_conn_idu'hiveserver2_default')[source]
Bases airflowhooksbase_hookBaseHook
Wrapper around the pyhive library
Note that the default authMechanism is PLAIN to override it you can specify it in the extra of your connection in the UI as in
get_pandas_df(hql schemau'default')[source]
Get a pandas dataframe from a Hive query
>>> hh HiveServer2Hook()
>>> sql SELECT * FROM airflowstatic_babynames LIMIT 100
>>> df hhget_pandas_df(sql)
>>> len(dfindex)
100
get_records(hql schemau'default')[source]
Get a set of records from a Hive query
>>> hh HiveServer2Hook()
>>> sql SELECT * FROM airflowstatic_babynames LIMIT 100
>>> len(hhget_records(sql))
100
get_results(hql schemau'default' fetch_sizeNone hive_confNone)[source]
Get results of the provided hql in target schema param hql hql to be executed param schema target schema default to default’ param fetch_size max size of result to fetch param hive_conf hive_conf to execute alone with the hql return results of hql execution
to_csv(hql csv_filepath schemau'default' delimiteru' ' lineterminatoru'\r\n' output_headerTrue fetch_size1000 hive_confNone)[source]
Execute hql in target schema and write results to a csv file param hql hql to be executed param csv_filepath filepath of csv to write results into param schema target schema default to default’ param delimiter delimiter of the csv file param lineterminator lineterminator of the csv file param output_header header of the csv file param fetch_size number of result rows to write into the csv file param hive_conf hive_conf to execute alone with the hql return
airflowhookshive_hooksget_context_from_env_var()[source]
Extract context from env variable eg dag_id task_id and execution_date so that they can be used inside BashOperator and PythonOperator return The context of interest
class airflowhookshttp_hookHttpHook(method'POST' http_conn_id'http_default')[source]
Bases airflowhooksbase_hookBaseHook
Interact with HTTP servers param http_conn_id connection that has the base API url ie httpswwwgooglecom
and optional authentication credentials Default headers can also be specified in the Extra field in json format
Parameters
method (str) – the API method to be called
check_response(response)[source]
Checks the status code and raise an AirflowException exception on non 2XX or 3XX status codes param response A requests response object type response requestsresponse
get_conn(headersNone)[source]
Returns http session for use with requests param headers additional headers to be passed through as a dictionary type headers dict
run(endpoint dataNone headersNone extra_optionsNone)[source]
Performs the request param endpoint the endpoint to be called ie resourcev1query type endpoint str param data payload to be uploaded or request parameters type data dict param headers additional headers to be passed through as a dictionary type headers dict param extra_options additional options to be used when executing the request
ie {check_response’ False} to avoid checking raising exceptions on non 2XX or 3XX status codes

run_and_check(session prepped_request extra_options)[source]
Grabs extra options like timeout and actually runs the request checking for the result param session the session to be used to execute the request type session requestsSession param prepped_request the prepared request generated in run() type prepped_request sessionprepare_request param extra_options additional options to be used when executing the request
ie {check_response’ False} to avoid checking raising exceptions on non 2XX or 3XX status codes

run_with_advanced_retry(_retry_args *args **kwargs)[source]
Runs Hookrun() with a Tenacity decorator attached to it This is useful for connectors which might be disturbed by intermittent issues and should not instantly fail param _retry_args Arguments which define the retry behaviour
See Tenacity documentation at httpsgithubcomjdtenacity

Example
hook HttpHook(http_conn_id’my_conn’method’GET’) retry_args dict(
waittenacitywait_exponential() stoptenacitystop_after_attempt(10) retryrequestsexceptionsConnectionError
) hookrun_with_advanced_retry(
endpoint’v1test’ _retry_argsretry_args
)
class airflowhooksdruid_hookDruidDbApiHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with Druid broker
This hook is purely for users to query druid broker For ingestion please use druidHook
get_conn()[source]
Establish a connection to druid broker
get_pandas_df(sql parametersNone)[source]
Executes the sql and returns a pandas dataframe
Parameters
· sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with
get_uri()[source]
Get the connection uri for druid broker
eg druidlocalhost8082druidv2sql
insert_rows(table rows target_fieldsNone commit_every1000)[source]
A generic way to insert a set of tuples into a table a new transaction is created every commit_every rows
Parameters
· table (str) – Name of the target table
· rows (iterable of tuples) – The rows to insert into the table
· target_fields (iterable of strings) – The names of the columns to fill in the table
· commit_every (int) – The maximum number of rows to insert in one transaction Set to 0 to insert all rows in one transaction
· replace (bool) – Whether to replace instead of insert
set_autocommit(conn autocommit)[source]
Sets the autocommit flag on the connection
class airflowhooksdruid_hookDruidHook(druid_ingest_conn_id'druid_ingest_default' timeout1 max_ingestion_timeNone)[source]
Bases airflowhooksbase_hookBaseHook
Connection to Druid overlord for ingestion
Parameters
· druid_ingest_conn_id (str) – The connection id to the Druid overlord machine which accepts index jobs
· timeout (int) – The interval between polling the Druid job for the status of the ingestion job Must be greater than or equal to 1
· max_ingestion_time (int) – The maximum ingestion time before assuming the job failed
class airflowhookshdfs_hookHDFSHook(hdfs_conn_id'hdfs_default' proxy_userNone autoconfigFalse)[source]
Bases airflowhooksbase_hookBaseHook
Interact with HDFS This class is a wrapper around the snakebite library
Parameters
· hdfs_conn_id – Connection id to fetch connection info
· proxy_user (str) – effective user for HDFS operations
· autoconfig (bool) – use snakebite’s automatically configured client
get_conn()[source]
Returns a snakebite HDFSClient object
class airflowhooksmssql_hookMsSqlHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with Microsoft SQL Server
get_conn()[source]
Returns a mssql connection object
set_autocommit(conn autocommit)[source]
Sets the autocommit flag on the connection
class airflowhooksmysql_hookMySqlHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with MySQL
You can specify charset in the extra field of your connection as {charset utf8} Also you can choose cursor as {cursor SSCursor} Refer to the MySQLdbcursors for more details
bulk_dump(table tmp_file)[source]
Dumps a database table into a tabdelimited file
bulk_load(table tmp_file)[source]
Loads a tabdelimited file into a database table
get_autocommit(conn)[source]
MySql connection gets autocommit in a different way
Parameters
conn (connection object) – connection to get autocommit setting from
Returns
connection autocommit setting
rtype bool
get_conn()[source]
Returns a mysql connection object
set_autocommit(conn autocommit)[source]
MySql connection sets autocommit in a different way
class airflowhookspig_hookPigCliHook(pig_cli_conn_id'pig_cli_default')[source]
Bases airflowhooksbase_hookBaseHook
Simple wrapper around the pig CLI
Note that you can also set default pig CLI properties using the pig_properties to be used in your connection as in {pig_properties Dpigtmpfilecompressiontrue}
run_cli(pig verboseTrue)[source]
Run an pig script using the pig cli
>>> ph PigCliHook()
>>> result phrun_cli(ls )
>>> (hdfs in result)
True
class airflowhookspostgres_hookPostgresHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with Postgres You can specify ssl parameters in the extra field of your connection as {sslmode require sslcert pathtocertpem etc}
Note For Redshift use keepalives_idle in the extra connection parameters and set it to less than 300 seconds
bulk_dump(table tmp_file)[source]
Dumps a database table into a tabdelimited file
bulk_load(table tmp_file)[source]
Loads a tabdelimited file into a database table
copy_expert(sql filename open)[source]
Executes SQL using psycopg2 copy_expert method Necessary to execute COPY command without access to a superuser
Note if this method is called with a COPY FROM statement and the specified input file does not exist it creates an empty file and no data is loaded but the operation succeeds So if users want to be aware when the input file does not exist they have to check its existence by themselves
get_conn()[source]
Returns a connection object
class airflowhookspresto_hookPrestoHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with Presto through PyHive
>>> ph PrestoHook()
>>> sql SELECT count(1) AS num FROM airflowstatic_babynames
>>> phget_records(sql)
[[340698]]
get_conn()[source]
Returns a connection object
get_first(hql parametersNone)[source]
Returns only the first row regardless of how many rows the query returns
get_pandas_df(hql parametersNone)[source]
Get a pandas dataframe from a sql query
get_records(hql parametersNone)[source]
Get a set of records from Presto
insert_rows(table rows target_fieldsNone)[source]
A generic way to insert a set of tuples into a table
Parameters
· table (str) – Name of the target table
· rows (iterable of tuples) – The rows to insert into the table
· target_fields (iterable of strings) – The names of the columns to fill in the table
run(hql parametersNone)[source]
Execute the statement against Presto Can be used to create views
class airflowhooksS3_hookS3Hook(aws_conn_id'aws_default' verifyNone)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS S3 using the boto3 library
check_for_bucket(bucket_name)[source]
Check if bucket_name exists
Parameters
bucket_name (str) – the name of the bucket
check_for_key(key bucket_nameNone)[source]
Checks if a key exists in a bucket
Parameters
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which the file is stored
check_for_prefix(bucket_name prefix delimiter)[source]
Checks that a prefix exists in a bucket
Parameters
· bucket_name (str) – the name of the bucket
· prefix (str) – a key prefix
· delimiter (str) – the delimiter marks key hierarchy
check_for_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
Checks that a key matching a wildcard expression exists in a bucket
Parameters
· wildcard_key (str) – the path to the key
· bucket_name (str) – the name of the bucket
· delimiter (str) – the delimiter marks key hierarchy
copy_object(source_bucket_key dest_bucket_key source_bucket_nameNone dest_bucket_nameNone source_version_idNone)[source]
Creates a copy of an object that is already stored in S3
Note the S3 connection used here needs to have access to both source and destination bucketkey
Parameters
· source_bucket_key (str) –
The key of the source object
It can be either full s3 style url or relative path from root level
When it’s specified as a full s3 url please omit source_bucket_name
· dest_bucket_key (str) –
The key of the object to copy to
The convention to specify dest_bucket_key is the same as source_bucket_key
· source_bucket_name (str) –
Name of the S3 bucket where the source object is in
It should be omitted when source_bucket_key is provided as a full s3 url
· dest_bucket_name (str) –
Name of the S3 bucket to where the object is copied
It should be omitted when dest_bucket_key is provided as a full s3 url
· source_version_id (str) – Version ID of the source object (OPTIONAL)
delete_objects(bucket keys)[source]
Parameters
· bucket (str) – Name of the bucket in which you are going to delete object(s)
· keys (str or list) –
The key(s) to delete from S3 bucket
When keys is a string it’s supposed to be the key name of the single object to delete
When keys is a list it’s supposed to be the list of the keys to delete
get_bucket(bucket_name)[source]
Returns a boto3S3Bucket object
Parameters
bucket_name (str) – the name of the bucket
get_key(key bucket_nameNone)[source]
Returns a boto3s3Object
Parameters
· key (str) – the path to the key
· bucket_name (str) – the name of the bucket
get_wildcard_key(wildcard_key bucket_nameNone delimiter'')[source]
Returns a boto3s3Object object matching the wildcard expression
Parameters
· wildcard_key (str) – the path to the key
· bucket_name (str) – the name of the bucket
· delimiter (str) – the delimiter marks key hierarchy
list_keys(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
Lists keys in a bucket under prefix and not containing delimiter
Parameters
· bucket_name (str) – the name of the bucket
· prefix (str) – a key prefix
· delimiter (str) – the delimiter marks key hierarchy
· page_size (int) – pagination size
· max_items (int) – maximum items to return
list_prefixes(bucket_name prefix'' delimiter'' page_sizeNone max_itemsNone)[source]
Lists prefixes in a bucket under prefix
Parameters
· bucket_name (str) – the name of the bucket
· prefix (str) – a key prefix
· delimiter (str) – the delimiter marks key hierarchy
· page_size (int) – pagination size
· max_items (int) – maximum items to return
load_bytes(bytes_data key bucket_nameNone replaceFalse encryptFalse)[source]
Loads bytes to S3
This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
Parameters
· bytes_data (bytes) – bytes to set as content for the key
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which to store the file
· replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
· encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
load_file(filename key bucket_nameNone replaceFalse encryptFalse)[source]
Loads a local file to S3
Parameters
· filename (str) – name of the file to load
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which to store the file
· replace (bool) – A flag to decide whether or not to overwrite the key if it already exists If replace is False and the key exists an error will be raised
· encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
load_string(string_data key bucket_nameNone replaceFalse encryptFalse encoding'utf8')[source]
Loads a string to S3
This is provided as a convenience to drop a string in S3 It uses the boto infrastructure to ship a file to s3
Parameters
· string_data (str) – str to set as content for the key
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which to store the file
· replace (bool) – A flag to decide whether or not to overwrite the key if it already exists
· encrypt (bool) – If True the file will be encrypted on the serverside by S3 and will be stored in an encrypted form while at rest in S3
read_key(key bucket_nameNone)[source]
Reads a key from S3
Parameters
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which the file is stored
select_key(key bucket_nameNone expression'SELECT * FROM S3Object' expression_type'SQL' input_serializationNone output_serializationNone)[source]
Reads a key with S3 Select
Parameters
· key (str) – S3 key that will point to the file
· bucket_name (str) – Name of the bucket in which the file is stored
· expression (str) – S3 Select expression
· expression_type (str) – S3 Select expression type
· input_serialization (dict) – S3 Select input data serialization format
· output_serialization (dict) – S3 Select output data serialization format
Returns
retrieved subset of original data by S3 Select
Return type
str
See also
For more details about S3 Select parameters httpboto3readthedocsioenlatestreferenceservicess3html#S3Clientselect_object_content
class airflowhooksslack_hookSlackHook(tokenNone slack_conn_idNone)[source]
Bases airflowhooksbase_hookBaseHook
Interact with Slack using slackclient library
class airflowhookssqlite_hookSqliteHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with SQLite
get_conn()[source]
Returns a sqlite connection object
Community contributed hooks
class airflowcontribhooksaws_dynamodb_hookAwsDynamoDBHook(table_keysNone table_nameNone region_nameNone *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS DynamoDB
Parameters
· table_keys (list) – partition key and sort key
· table_name (str) – target DynamoDB table
· region_name (str) – aws region name (example useast1)
write_batch_data(items)[source]
Write batch items to dynamodb table with provisioned throughout capacity
class airflowcontribhooksaws_hookAwsHook(aws_conn_id'aws_default' verifyNone)[source]
Bases airflowhooksbase_hookBaseHook
Interact with AWS This class is a thin wrapper around the boto3 python library
get_credentials(region_nameNone)[source]
Get the underlying botocoreCredentials object
This contains the attributes access_key secret_key and token
get_session(region_nameNone)[source]
Get the underlying boto3session
class airflowcontribhooksaws_lambda_hookAwsLambdaHook(function_name region_nameNone log_type'None' qualifier'LATEST' invocation_type'RequestResponse' *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS Lambda
Parameters
· function_name (str) – AWS Lambda Function Name
· region_name (str) – AWS Region Name (example uswest2)
· log_type (str) – Tail Invocation Request
· qualifier (str) – AWS Lambda Function Version or Alias Name
· invocation_type (str) – AWS Lambda Invocation Type (RequestResponse Event etc)
invoke_lambda(payload)[source]
Invoke Lambda Function
class airflowcontribhooksaws_firehose_hookAwsFirehoseHook(delivery_stream region_nameNone *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS Kinesis Firehose param delivery_stream Name of the delivery stream type delivery_stream str param region_name AWS region name (example useast1) type region_name str
get_conn()[source]
Returns AwsHook connection object
put_records(records)[source]
Write batch records to Kinesis Firehose
class airflowcontribhooksbigquery_hookBigQueryHook(bigquery_conn_id'bigquery_default' delegate_toNone use_legacy_sqlTrue)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook airflowhooksdbapi_hookDbApiHook airflowutilsloglogging_mixinLoggingMixin
Interact with BigQuery This hook uses the Google Cloud Platform connection
get_conn()[source]
Returns a BigQuery PEP 249 connection object
get_pandas_df(sql parametersNone dialectNone)[source]
Returns a Pandas DataFrame for the results produced by a BigQuery query The DbApiHook method must be overridden because Pandas doesn’t support PEP 249 connections except for SQLite See
httpsgithubcompydatapandasblobmasterpandasiosqlpy#L447 httpsgithubcompydatapandasissues6900
Parameters
· sql (str) – The BigQuery SQL to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with (not used leave to override superclass method)
· dialect (str in {'legacy' 'standard'}) – Dialect of BigQuery SQL – legacy SQL or standard SQL defaults to use selfuse_legacy_sql if not specified
get_service()[source]
Returns a BigQuery service object
insert_rows(table rows target_fieldsNone commit_every1000)[source]
Insertion is currently unsupported Theoretically you could use BigQuery’s streaming API to insert rows into a table but this hasn’t been implemented
table_exists(project_id dataset_id table_id)[source]
Checks for the existence of a table in Google BigQuery
Parameters
· project_id (str) – The Google cloud project in which to look for the table The connection supplied to the hook must provide access to the specified project
· dataset_id (str) – The name of the dataset in which to look for the table
· table_id (str) – The name of the table to check the existence of
class airflowcontribhookscassandra_hookCassandraHook(cassandra_conn_id'cassandra_default')[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
Hook used to interact with Cassandra
Contact points can be specified as a commaseparated string in the hosts’ field of the connection
Port can be specified in the port field of the connection
If SSL is enabled in Cassandra pass in a dict in the extra field as kwargs for sslwrap_socket() For example
{
ssl_options’ {
ca_certs’ PATH_TO_CA_CERTS
}
}
Default load balancing policy is RoundRobinPolicy To specify a different LB policy
· DCAwareRoundRobinPolicy
{
load_balancing_policy’ DCAwareRoundRobinPolicy’ load_balancing_policy_args’ {
local_dc’ LOCAL_DC_NAME optional used_hosts_per_remote_dc’ SOME_INT_VALUE optional
}
}
· WhiteListRoundRobinPolicy
{
load_balancing_policy’ WhiteListRoundRobinPolicy’ load_balancing_policy_args’ {
hosts’ [HOST1’ HOST2’ HOST3’]
}
}
· TokenAwarePolicy
{
load_balancing_policy’ TokenAwarePolicy’ load_balancing_policy_args’ {
child_load_balancing_policy’ CHILD_POLICY_NAME optional child_load_balancing_policy_args’ { … } optional
}
}
For details of the Cluster config see cassandracluster
get_conn()[source]
Returns a cassandra Session object
record_exists(table keys)[source]
Checks if a record exists in Cassandra
Parameters
· table (str) – Target Cassandra table Use dot notation to target a specific keyspace
· keys (dict) – The keys and their values to check the existence
shutdown_cluster()[source]
Closes all sessions and connections associated with this Cluster
table_exists(table)[source]
Checks if a table exists in Cassandra
Parameters
table (str) – Target Cassandra table Use dot notation to target a specific keyspace
class airflowcontribhookscloudant_hookCloudantHook(cloudant_conn_id'cloudant_default')[source]
Bases airflowhooksbase_hookBaseHook
Interact with Cloudant
This class is a thin wrapper around the cloudant python library See the documentation here
db()[source]
Returns the Database object for this hook
See the documentation for cloudantpython here httpsgithubcomcloudantlabscloudantpython
class airflowcontribhooksdatabricks_hookDatabricksHook(databricks_conn_id'databricks_default' timeout_seconds180 retry_limit3 retry_delay10)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
Interact with Databricks
run_now(json)[source]
Utility function to call the api20jobsrunnow endpoint
Parameters
json (dict) – The data used in the body of the request to the runnow endpoint
Returns
the run_id as a string
Return type
str
submit_run(json)[source]
Utility function to call the api20jobsrunssubmit endpoint
Parameters
json (dict) – The data used in the body of the request to the submit endpoint
Returns
the run_id as a string
Return type
str
class airflowcontribhooksdatastore_hookDatastoreHook(datastore_conn_id'google_cloud_datastore_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Interact with Google Cloud Datastore This hook uses the Google Cloud Platform connection
This object is not threads safe If you want to make multiple requests simultaneously you will need to create a hook per thread
allocate_ids(partialKeys)[source]
Allocate IDs for incomplete keys see httpscloudgooglecomdatastoredocsreferencerestv1projectsallocateIds
Parameters
partialKeys – a list of partial keys
Returns
a list of full keys
begin_transaction()[source]
Get a new transaction handle
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectsbeginTransaction
Returns
a transaction handle
commit(body)[source]
Commit a transaction optionally creating deleting or modifying some entities
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectscommit
Parameters
body – the body of the commit request
Returns
the response body of the commit request
delete_operation(name)[source]
Deletes the longrunning operation
Parameters
name – the name of the operation resource
export_to_storage_bucket(bucket namespaceNone entity_filterNone labelsNone)[source]
Export entities from Cloud Datastore to Cloud Storage for backup
get_conn(version'v1')[source]
Returns a Google Cloud Storage service object
get_operation(name)[source]
Gets the latest state of a longrunning operation
Parameters
name – the name of the operation resource
import_from_storage_bucket(bucket file namespaceNone entity_filterNone labelsNone)[source]
Import a backup from Cloud Storage to Cloud Datastore
lookup(keys read_consistencyNone transactionNone)[source]
Lookup some entities by key
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectslookup
Parameters
· keys – the keys to lookup
· read_consistency – the read consistency to use default strong or eventual Cannot be used with a transaction
· transaction – the transaction to use if any
Returns
the response body of the lookup request
poll_operation_until_done(name polling_interval_in_seconds)[source]
Poll backup operation state until it’s completed
rollback(transaction)[source]
Roll back a transaction
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectsrollback
Parameters
transaction – the transaction to roll back
run_query(body)[source]
Run a query for entities
See also
httpscloudgooglecomdatastoredocsreferencerestv1projectsrunQuery
Parameters
body – the body of the query request
Returns
the batch of query results
class airflowcontribhooksdiscord_webhook_hookDiscordWebhookHook(http_conn_idNone webhook_endpointNone message'' usernameNone avatar_urlNone ttsFalse proxyNone *args **kwargs)[source]
Bases airflowhookshttp_hookHttpHook
This hook allows you to post messages to Discord using incoming webhooks Takes a Discord connection ID with a default relative webhook endpoint The default endpoint can be overridden using the webhook_endpoint parameter (httpsdiscordappcomdevelopersdocsresourceswebhook)
Each Discord webhook can be preconfigured to use a specific username and avatar_url You can override these defaults in this hook
Parameters
· http_conn_id (str) – Http connection ID with host as httpsdiscordcomapi and default webhook endpoint in the extra field in the form of {webhook_endpoint webhooks{webhookid}{webhooktoken}}
· webhook_endpoint (str) – Discord webhook endpoint in the form of webhooks{webhookid}{webhooktoken}
· message (str) – The message you want to send to your Discord channel (max 2000 characters)
· username (str) – Override the default username of the webhook
· avatar_url (str) – Override the default avatar of the webhook
· tts (bool) – Is a texttospeech message
· proxy (str) – Proxy to use to make the Discord webhook call
execute()[source]
Execute the Discord webhook call
class airflowcontribhooksemr_hookEmrHook(emr_conn_idNone *args **kwargs)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS EMR emr_conn_id is only neccessary for using the create_job_flow method
create_job_flow(job_flow_overrides)[source]
Creates a job flow using the config from the EMR connection Keys of the json extra hash may have the arguments of the boto3 run_job_flow method Overrides for this config may be passed as the job_flow_overrides
class airflowcontribhooksfs_hookFSHook(conn_id'fs_default')[source]
Bases airflowhooksbase_hookBaseHook
Allows for interaction with an file server
Connection should have a name and a path specified under extra
example Conn Id fs_test Conn Type File (path) Host Shchema Login Password Port empty Extra {path tmp}
class airflowcontribhooksftp_hookFTPHook(ftp_conn_id'ftp_default')[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
Interact with FTP
Errors that may occur throughout but should be handled downstream
close_conn()[source]
Closes the connection An error will occur if the connection wasn’t ever opened
create_directory(path)[source]
Creates a directory on the remote system
Parameters
path (str) – full path to the remote directory to create
delete_directory(path)[source]
Deletes a directory on the remote system
Parameters
path (str) – full path to the remote directory to delete
delete_file(path)[source]
Removes a file on the FTP Server
Parameters
path (str) – full path to the remote file
describe_directory(path)[source]
Returns a dictionary of {filename {attributes}} for all files on the remote system (where the MLSD command is supported)
Parameters
path (str) – full path to the remote directory
get_conn()[source]
Returns a FTP connection object
get_mod_time(path)[source]
Returns a datetime object representing the last time the file was modified
Parameters
path (string) – remote file path
get_size(path)[source]
Returns the size of a file (in bytes)
Parameters
path (string) – remote file path
list_directory(path nlstFalse)[source]
Returns a list of files on the remote system
Parameters
path (str) – full path to the remote directory to list
rename(from_name to_name)[source]
Rename a file
Parameters
· from_name – rename file from name
· to_name – rename file to name
retrieve_file(remote_full_path local_full_path_or_buffer callbackNone)[source]
Transfers the remote file to a local location
If local_full_path_or_buffer is a string path the file will be put at that location if it is a filelike buffer the file will be written to the buffer but not closed
Parameters
· remote_full_path (str) – full path to the remote file
· local_full_path_or_buffer (str or filelike buffer) – full path to the local file or a filelike buffer
· callback (callable) – callback which is called each time a block of data is read if you do not use a callback these blocks will be written to the file or buffer passed in if you do pass in a callback note that writing to a file or buffer will need to be handled inside the callback [default output_handlewrite()]
Example
hook FTPHook(ftp_conn_id’my_conn’)
remote_path pathtoremotefile’ local_path pathtolocalfile’
# with a custom callback (in this case displaying progress on each read) def print_progress(percent_progress)
selfloginfo(Percent Downloaded s’ percent_progress)
total_downloaded 0 total_file_size hookget_size(remote_path) output_handle open(local_path wb’) def write_to_file_with_progress(data)
total_downloaded + len(data) output_handlewrite(data) percent_progress (total_downloaded total_file_size) * 100 print_progress(percent_progress)
hookretrieve_file(remote_path None callbackwrite_to_file_with_progress)
# without a custom callback data is written to the local_path hookretrieve_file(remote_path local_path)
store_file(remote_full_path local_full_path_or_buffer)[source]
Transfers a local file to the remote location
If local_full_path_or_buffer is a string path the file will be read from that location if it is a filelike buffer the file will be read from the buffer but not closed
Parameters
· remote_full_path (str) – full path to the remote file
· local_full_path_or_buffer (str or filelike buffer) – full path to the local file or a filelike buffer
class airflowcontribhooksftp_hookFTPSHook(ftp_conn_id'ftp_default')[source]
Bases airflowcontribhooksftp_hookFTPHook
get_conn()[source]
Returns a FTPS connection object
class airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
A base hook for Google cloudrelated hooks Google cloud has a shared REST API client that is built in the same way no matter which service you use This class helps construct and authorize the credentials needed to then call apiclientdiscoverybuild() to actually discover and build a client for a Google cloud service
The class also contains some miscellaneous helper functions
All hook derived from this base hook use the Google Cloud Platform’ connection type Two ways of authentication are supported
Default credentials Only the Project Id’ is required You’ll need to have set up default credentials such as by the GOOGLE_APPLICATION_DEFAULT environment variable or from the metadata server on Google Compute Engine
JSON key file Specify Project Id’ Key Path’ and Scope’
Legacy P12 key files are not supported
class airflowcontribhooksgcp_container_hookGKEClusterHook(project_id location)[source]
Bases airflowhooksbase_hookBaseHook
create_cluster(cluster retry timeout)[source]
Creates a cluster consisting of the specified number and type of Google Compute Engine instances
Parameters
· cluster (dict or googlecloudcontainer_v1typesCluster) – A Cluster protobuf or dict If dict is provided it must be of the same form as the protobuf message googlecloudcontainer_v1typesCluster
· retry (googleapi_coreretryRetry) – A retry object (googleapi_coreretryRetry) used to retry requests If None is specified requests will not be retried
· timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
Returns
The full url to the new or existing cluster
raises
ParseError On JSON parsing problems when trying to convert dict AirflowException cluster is not dict type nor Cluster proto type
delete_cluster(name retry timeout)[source]
Deletes the cluster including the Kubernetes endpoint and all worker nodes Firewalls and routes that were configured during cluster creation are also deleted Other Google Compute Engine resources that might be in use by the cluster (eg load balancer resources) will not be deleted if they weren’t present at the initial create time
Parameters
· name (str) – The name of the cluster to delete
· retry (googleapi_coreretryRetry) – Retry object used to determine whenif to retry requests If None is specified requests will not be retried
· timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
Returns
The full url to the delete operation if successful else None
get_cluster(name retry timeout)[source]
Gets details of specified cluster
Parameters
· name (str) – The name of the cluster to retrieve
· retry (googleapi_coreretryRetry) – A retry object used to retry requests If None is specified requests will not be retried
· timeout (float) – The amount of time in seconds to wait for the request to complete Note that if retry is specified the timeout applies to each individual attempt
Returns
A googlecloudcontainer_v1typesCluster instance
get_operation(operation_name)[source]
Fetches the operation from Google Cloud
Parameters
operation_name (str) – Name of operation to fetch
Returns
The new updated operation from Google Cloud
wait_for_operation(operation)[source]
Given an operation continuously fetches the status from Google Cloud until either completion or an error occurring
Parameters
operation (A googlecloudcontainer_V1gapicenumsOperator) – The Operation to wait for
Returns
A new updated operation fetched from Google Cloud
class airflowcontribhooksgcp_dataflow_hookDataFlowHook(gcp_conn_id'google_cloud_default' delegate_toNone poll_sleep10)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
get_conn()[source]
Returns a Google Cloud Dataflow service object
class airflowcontribhooksgcp_dataproc_hookDataProcHook(gcp_conn_id'google_cloud_default' delegate_toNone api_version'v1beta2')[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Hook for Google Cloud Dataproc APIs
await(operation)
Awaits for Google Cloud Dataproc Operation to complete
get_conn()[source]
Returns a Google Cloud Dataproc service object
wait(operation)[source]
Awaits for Google Cloud Dataproc Operation to complete
class airflowcontribhooksgcp_mlengine_hookMLEngineHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
create_job(project_id job use_existing_job_fnNone)[source]
Launches a MLEngine job and wait for it to reach a terminal state
Parameters
· project_id (str) – The Google Cloud project id within which MLEngine job will be launched
· job (dict) –
MLEngine Job object that should be provided to the MLEngine API such as
{
'jobId' 'my_job_id'
'trainingInput' {
'scaleTier' 'STANDARD_1'

}
}
· use_existing_job_fn (function) – In case that a MLEngine job with the same job_id already exist this method (if provided) will decide whether we should use this existing job continue waiting for it to finish and returning the job object It should accepts a MLEngine job object and returns a boolean value indicating whether it is OK to reuse the existing job If use_existing_job_fn’ is not provided we by default reuse the existing MLEngine job
Returns
The MLEngine job object if the job successfully reach a terminal state (which might be FAILED or CANCELLED state)
Return type
dict
create_model(project_id model)[source]
Create a Model Blocks until finished
create_version(project_id model_name version_spec)[source]
Creates the Version on Google Cloud ML Engine
Returns the operation if the version was created successfully and raises an error otherwise
delete_version(project_id model_name version_name)[source]
Deletes the given version of a model Blocks until finished
get_conn()[source]
Returns a Google MLEngine service object
get_model(project_id model_name)[source]
Gets a Model Blocks until finished
list_versions(project_id model_name)[source]
Lists all available versions of a model Blocks until finished
set_default_version(project_id model_name version_name)[source]
Sets a version to be the default Blocks until finished
class airflowcontribhooksgcp_pubsub_hookPubSubHook(gcp_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Hook for accessing Google PubSub
The GCP project against which actions are applied is determined by the project embedded in the Connection referenced by gcp_conn_id
acknowledge(project subscription ack_ids)[source]
Pulls up to max_messages messages from PubSub subscription
Parameters
· project (str) – the GCP project name or ID in which to create the topic
· subscription (str) – the PubSub subscription name to delete do not include the projects{project}topics’ prefix
· ack_ids (list) – List of ReceivedMessage ackIds from a previous pull response
create_subscription(topic_project topic subscriptionNone subscription_projectNone ack_deadline_secs10 fail_if_existsFalse)[source]
Creates a PubSub subscription if it does not already exist
Parameters
· topic_project (str) – the GCP project ID of the topic that the subscription will be bound to
· topic (str) – the PubSub topic name that the subscription will be bound to create do not include the projects{project}subscriptions prefix
· subscription (str) – the PubSub subscription name If empty a random name will be generated using the uuid module
· subscription_project (str) – the GCP project ID where the subscription will be created If unspecified topic_project will be used
· ack_deadline_secs (int) – Number of seconds that a subscriber has to acknowledge each message pulled from the subscription
· fail_if_exists (bool) – if set raise an exception if the topic already exists
Returns
subscription name which will be the systemgenerated value if the subscription parameter is not supplied
Return type
str
create_topic(project topic fail_if_existsFalse)[source]
Creates a PubSub topic if it does not already exist
Parameters
· project (str) – the GCP project ID in which to create the topic
· topic (str) – the PubSub topic name to create do not include the projects{project}topics prefix
· fail_if_exists (bool) – if set raise an exception if the topic already exists
delete_subscription(project subscription fail_if_not_existsFalse)[source]
Deletes a PubSub subscription if it exists
Parameters
· project (str) – the GCP project ID where the subscription exists
· subscription (str) – the PubSub subscription name to delete do not include the projects{project}subscriptions prefix
· fail_if_not_exists (bool) – if set raise an exception if the topic does not exist
delete_topic(project topic fail_if_not_existsFalse)[source]
Deletes a PubSub topic if it exists
Parameters
· project (str) – the GCP project ID in which to delete the topic
· topic (str) – the PubSub topic name to delete do not include the projects{project}topics prefix
· fail_if_not_exists (bool) – if set raise an exception if the topic does not exist
get_conn()[source]
Returns a PubSub service object
Return type
apiclientdiscoveryResource
publish(project topic messages)[source]
Publishes messages to a PubSub topic
Parameters
· project (str) – the GCP project ID in which to publish
· topic (str) – the PubSub topic to which to publish do not include the projects{project}topics prefix
· messages (list of PubSub messages see httpcloudgooglecompubsubdocsreferencerestv1PubsubMessage) – messages to publish if the data field in a message is set it should already be base64 encoded
pull(project subscription max_messages return_immediatelyFalse)[source]
Pulls up to max_messages messages from PubSub subscription
Parameters
· project (str) – the GCP project ID where the subscription exists
· subscription (str) – the PubSub subscription name to pull from do not include the projects{project}topics’ prefix
· max_messages (int) – The maximum number of messages to return from the PubSub API
· return_immediately (bool) – If set the PubSub API will immediately return if no messages are available Otherwise the request will block for an undisclosed but bounded period of time
return A list of PubSub ReceivedMessage objects each containing
an ackId property and a message property which includes the base64encoded message content See httpscloudgooglecompubsubdocsreferencerestv1 projectssubscriptionspull#ReceivedMessage
class airflowcontribhooksgcs_hookGoogleCloudStorageHook(google_cloud_storage_conn_id'google_cloud_default' delegate_toNone)[source]
Bases airflowcontribhooksgcp_api_base_hookGoogleCloudBaseHook
Interact with Google Cloud Storage This hook uses the Google Cloud Platform connection
copy(source_bucket source_object destination_bucketNone destination_objectNone)[source]
Copies an object from a bucket to another with renaming if requested
destination_bucket or destination_object can be omitted in which case source bucketobject is used but not both
Parameters
· source_bucket (str) – The bucket of the object to copy from
· source_object (str) – The object to copy
· destination_bucket (str) – The destination of the object to copied to Can be omitted then the same bucket is used
· destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
create_bucket(bucket_name storage_class'MULTI_REGIONAL' location'US' project_idNone labelsNone)[source]
Creates a new bucket Google Cloud Storage uses a flat namespace so you can’t create a bucket with a name that is already in use
See also
For more information see Bucket Naming Guidelines httpscloudgooglecomstoragedocsbucketnaminghtml#requirements
Parameters
· bucket_name (str) – The name of the bucket
· storage_class (str) –
This defines how objects in the bucket are stored and determines the SLA and the cost of storage Values include
o MULTI_REGIONAL
o REGIONAL
o STANDARD
o NEARLINE
o COLDLINE
If this value is not specified when the bucket is created it will default to STANDARD
· location (str) –
The location of the bucket Object data for objects in the bucket resides in physical storage within this region Defaults to US
See also
httpsdevelopersgooglecomstoragedocsbucketlocations
· project_id (str) – The ID of the GCP Project
· labels (dict) – Userprovided labels in keyvalue pairs
Returns
If successful it returns the id of the bucket
delete(bucket object generationNone)[source]
Delete an object if versioning is not enabled for the bucket or if generation parameter is used
Parameters
· bucket (str) – name of the bucket where the object resides
· object (str) – name of the object to delete
· generation (str) – if present permanently delete the object of this generation
Returns
True if succeeded
download(bucket object filenameNone)[source]
Get a file from Google Cloud Storage
Parameters
· bucket (str) – The bucket to fetch from
· object (str) – The object to fetch
· filename (str) – If set a local file path where the file should be written to
exists(bucket object)[source]
Checks for the existence of a file in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
get_conn()[source]
Returns a Google Cloud Storage service object
get_crc32c(bucket object)[source]
Gets the CRC32c checksum of an object in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
get_md5hash(bucket object)[source]
Gets the MD5 hash of an object in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
get_size(bucket object)[source]
Gets the size of a file in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
is_updated_after(bucket object ts)[source]
Checks if an object is updated in Google Cloud Storage
Parameters
· bucket (str) – The Google cloud storage bucket where the object is
· object (str) – The name of the object to check in the Google cloud storage bucket
· ts (datetime) – The timestamp to check against
list(bucket versionsNone maxResultsNone prefixNone delimiterNone)[source]
List all objects from the bucket with the give string prefix in name
Parameters
· bucket (str) – bucket name
· versions (bool) – if true list all versions of the objects
· maxResults (int) – max count of items to return in a single page of responses
· prefix (str) – prefix string which filters objects whose name begin with this prefix
· delimiter (str) – filters objects based on the delimiter (for eg csv’)
Returns
a stream of object names matching the filtering criteria
rewrite(source_bucket source_object destination_bucket destination_objectNone)[source]
Has the same functionality as copy except that will work on files over 5 TB as well as when copying between locations andor storage classes
destination_object can be omitted in which case source_object is used
Parameters
· source_bucket (str) – The bucket of the object to copy from
· source_object (str) – The object to copy
· destination_bucket (str) – The destination of the object to copied to
· destination_object (str) – The (renamed) path of the object if given Can be omitted then the same name is used
upload(bucket object filename mime_type'applicationoctetstream' gzipFalse)[source]
Uploads a local file to Google Cloud Storage
Parameters
· bucket (str) – The bucket to upload to
· object (str) – The object name to set when uploading the local file
· filename (str) – The local file path to the file to be uploaded
· mime_type (str) – The MIME type to set when uploading the file
· gzip (bool) – Option to compress file for upload
class airflowcontribhooksmongo_hookMongoHook(conn_id'mongo_default' *args **kwargs)[source]
Bases airflowhooksbase_hookBaseHook
PyMongo Wrapper to Interact With Mongo Database Mongo Connection Documentation httpsdocsmongodbcommanualreferenceconnectionstringindexhtml You can specify connection string options in extra field of your connection httpsdocsmongodbcommanualreferenceconnectionstringindexhtml#connectionstringoptions ex
{replicaSet test ssl True connectTimeoutMS 30000}
aggregate(mongo_collection aggregate_query mongo_dbNone **kwargs)[source]
Runs an aggregation pipeline and returns the results httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectionaggregate httpsapimongodbcompythoncurrentexamplesaggregationhtml
find(mongo_collection query find_oneFalse mongo_dbNone **kwargs)[source]
Runs a mongo find query and returns the results httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectionfind
get_collection(mongo_collection mongo_dbNone)[source]
Fetches a mongo collection object for querying
Uses connection schema as DB unless specified
get_conn()[source]
Fetches PyMongo Client
insert_many(mongo_collection docs mongo_dbNone **kwargs)[source]
Inserts many docs into a mongo collection httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectioninsert_many
insert_one(mongo_collection doc mongo_dbNone **kwargs)[source]
Inserts a single document into a mongo collection httpsapimongodbcompythoncurrentapipymongocollectionhtml#pymongocollectionCollectioninsert_one
class airflowcontribhookspinot_hookPinotDbApiHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Connect to pinot db(httpsgithubcomlinkedinpinot) to issue pql
get_conn()[source]
Establish a connection to pinot broker through pinot dbqpi
get_first(sql)[source]
Executes the sql and returns the first resulting row
Parameters
sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
get_pandas_df(sql parametersNone)[source]
Executes the sql and returns a pandas dataframe
Parameters
· sql (str or list) – the sql statement to be executed (str) or a list of sql statements to execute
· parameters (mapping or iterable) – The parameters to render the SQL query with
get_records(sql)[source]
Executes the sql and returns a set of records
Parameters
sql (str) – the sql statement to be executed (str) or a list of sql statements to execute
get_uri()[source]
Get the connection uri for pinot broker
eg httplocalhost9000pql
insert_rows(table rows target_fieldsNone commit_every1000)[source]
A generic way to insert a set of tuples into a table a new transaction is created every commit_every rows
Parameters
· table (str) – Name of the target table
· rows (iterable of tuples) – The rows to insert into the table
· target_fields (iterable of strings) – The names of the columns to fill in the table
· commit_every (int) – The maximum number of rows to insert in one transaction Set to 0 to insert all rows in one transaction
· replace (bool) – Whether to replace instead of insert
set_autocommit(conn autocommit)[source]
Sets the autocommit flag on the connection
class airflowcontribhooksredshift_hookRedshiftHook(aws_conn_id'aws_default' verifyNone)[source]
Bases airflowcontribhooksaws_hookAwsHook
Interact with AWS Redshift using the boto3 library
cluster_status(cluster_identifier)[source]
Return status of a cluster
Parameters
cluster_identifier (str) – unique identifier of a cluster
create_cluster_snapshot(snapshot_identifier cluster_identifier)[source]
Creates a snapshot of a cluster
Parameters
· snapshot_identifier (str) – unique identifier for a snapshot of a cluster
· cluster_identifier (str) – unique identifier of a cluster
delete_cluster(cluster_identifier skip_final_cluster_snapshotTrue final_cluster_snapshot_identifier'')[source]
Delete a cluster and optionally create a snapshot
Parameters
· cluster_identifier (str) – unique identifier of a cluster
· skip_final_cluster_snapshot (bool) – determines cluster snapshot creation
· final_cluster_snapshot_identifier (str) – name of final cluster snapshot
describe_cluster_snapshots(cluster_identifier)[source]
Gets a list of snapshots for a cluster
Parameters
cluster_identifier (str) – unique identifier of a cluster
restore_from_cluster_snapshot(cluster_identifier snapshot_identifier)[source]
Restores a cluster from its snapshot
Parameters
· cluster_identifier (str) – unique identifier of a cluster
· snapshot_identifier (str) – unique identifier for a snapshot of a cluster
class airflowcontribhookssalesforce_hookSalesforceHook(conn_id *args **kwargs)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
describe_object(obj)[source]
Get the description of an object from Salesforce
This description is the object’s schema and some extra metadata that Salesforce stores for each object
Parameters
obj – Name of the Salesforce object that we are getting a description of
get_available_fields(obj)[source]
Get a list of all available fields for an object
This only returns the names of the fields
get_object_from_salesforce(obj fields)[source]
Get all instances of the object from Salesforce For each model only get the fields specified in fields
All we really do underneath the hood is run
SELECT FROM
make_query(query)[source]
Make a query to Salesforce Returns result in dictionary
Parameters
query – The query to make to Salesforce
sign_in()[source]
Sign into Salesforce
If we have already signed it this will just return the original object
write_object_to_file(query_results filename fmt'csv' coerce_to_timestampFalse record_time_addedFalse)[source]
Write query results to file
Acceptable formats are
· csv
commaseparatedvalues file This is the default format
· json
JSON array Each element in the array is a different row
· ndjson
JSON array but each element is newline delimited instead of comma delimited like in json
This requires a significant amount of cleanup Pandas doesn’t handle output to CSV and json in a uniform way This is especially painful for datetime types Pandas wants to write them as strings in CSV but as millisecond Unix timestamps
By default this function will try and leave all values as they are represented in Salesforce You use the coerce_to_timestamp flag to force all datetimes to become Unix timestamps (UTC) This is can be greatly beneficial as it will make all of your datetime fields look the same and makes it easier to work with in other database environments
Parameters
· query_results – the results from a SQL query
· filename – the name of the file where the data should be dumped to
· fmt – the format you want the output in Default csv
· coerce_to_timestamp – True if you want all datetime fields to be converted into Unix timestamps False if you want them to be left in the same format as they were in Salesforce Leaving the value as False will result in datetimes being strings Defaults to False
· record_time_added – (optional) True if you want to add a Unix timestamp field to the resulting data that marks when the data was fetched from Salesforce Default False
class airflowcontribhookssftp_hookSFTPHook(ftp_conn_id'sftp_default' *args **kwargs)[source]
Bases airflowcontribhooksssh_hookSSHHook
This hook is inherited from SSH hook Please refer to SSH hook for the input arguments
Interact with SFTP Aims to be interchangeable with FTPHook
Pitfalls In contrast with FTPHook describe_directory only returns size type and
modify It doesn’t return unixowner unixmode perm unixgroup and unique
· retrieve_file and store_file only take a local full path and not a buffer
· If no mode is passed to create_directory it will be created with 777 permissions
Errors that may occur throughout but should be handled downstream
close_conn()[source]
Closes the connection An error will occur if the connection wasnt ever opened
create_directory(path mode777)[source]
Creates a directory on the remote system param path full path to the remote directory to create type path str param mode int representation of octal mode for directory
delete_directory(path)[source]
Deletes a directory on the remote system param path full path to the remote directory to delete type path str
delete_file(path)[source]
Removes a file on the FTP Server param path full path to the remote file type path str
describe_directory(path)[source]
Returns a dictionary of {filename {attributes}} for all files on the remote system (where the MLSD command is supported) param path full path to the remote directory type path str
get_conn()[source]
Returns an SFTP connection object
list_directory(path)[source]
Returns a list of files on the remote system param path full path to the remote directory to list type path str
retrieve_file(remote_full_path local_full_path)[source]
Transfers the remote file to a local location If local_full_path is a string path the file will be put at that location param remote_full_path full path to the remote file type remote_full_path str param local_full_path full path to the local file type local_full_path str
store_file(remote_full_path local_full_path)[source]
Transfers a local file to the remote location If local_full_path_or_buffer is a string path the file will be read from that location param remote_full_path full path to the remote file type remote_full_path str param local_full_path full path to the local file type local_full_path str
class airflowcontribhooksslack_webhook_hookSlackWebhookHook(http_conn_idNone webhook_tokenNone message'' channelNone usernameNone icon_emojiNone link_namesFalse proxyNone *args **kwargs)[source]
Bases airflowhookshttp_hookHttpHook
This hook allows you to post messages to Slack using incoming webhooks Takes both Slack webhook token directly and connection that has Slack webhook token If both supplied Slack webhook token will be used
Each Slack webhook token can be preconfigured to use a specific channel username and icon You can override these defaults in this hook
Parameters
· http_conn_id (str) – connection that has Slack webhook token in the extra field
· webhook_token (str) – Slack webhook token
· message (str) – The message you want to send on Slack
· channel (str) – The channel the message should be posted to
· username (str) – The username to post to slack with
· icon_emoji (str) – The emoji to use as icon for the user posting to Slack
· link_names (bool) – Whether or not to find and link channel and usernames in your message
· proxy (str) – Proxy to use to make the Slack webhook call
execute()[source]
Remote Popen (actually execute the slack webhook call)
Parameters
· cmd – command to remotely execute
· kwargs – extra arguments to Popen (see subprocessPopen)
class airflowcontribhooksspark_jdbc_hookSparkJDBCHook(spark_app_name'airflowsparkjdbc' spark_conn_id'sparkdefault' spark_confNone spark_py_filesNone spark_filesNone spark_jarsNone num_executorsNone executor_coresNone executor_memoryNone driver_memoryNone verboseFalse principalNone keytabNone cmd_type'spark_to_jdbc' jdbc_tableNone jdbc_conn_id'jdbcdefault' jdbc_driverNone metastore_tableNone jdbc_truncateFalse save_modeNone save_formatNone batch_sizeNone fetch_sizeNone num_partitionsNone partition_columnNone lower_boundNone upper_boundNone create_table_column_typesNone *args **kwargs)[source]
Bases airflowcontribhooksspark_submit_hookSparkSubmitHook
This hook extends the SparkSubmitHook specifically for performing data transfers tofrom JDBCbased databases with Apache Spark
Parameters
· spark_app_name (str) – Name of the job (default airflowsparkjdbc)
· spark_conn_id (str) – Connection id as configured in Airflow administration
· spark_conf (dict) – Any additional Spark configuration properties
· spark_py_files (str) – Additional python files used (zip egg or py)
· spark_files (str) – Additional files to upload to the container running the job
· spark_jars (str) – Additional jars to upload and add to the driver and executor classpath
· num_executors (int) – number of executor to run This should be set so as to manage the number of connections made with the JDBC database
· executor_cores (int) – Number of cores per executor
· executor_memory (str) – Memory per executor (eg 1000M 2G)
· driver_memory (str) – Memory allocated to the driver (eg 1000M 2G)
· verbose (bool) – Whether to pass the verbose flag to sparksubmit for debugging
· keytab (str) – Full path to the file that contains the keytab
· principal (str) – The name of the kerberos principal used for keytab
· cmd_type (str) – Which way the data should flow 2 possible values spark_to_jdbc data written by spark from metastore to jdbc jdbc_to_spark data written by spark from jdbc to metastore
· jdbc_table (str) – The name of the JDBC table
· jdbc_conn_id – Connection id used for connection to JDBC database
· jdbc_driver (str) – Name of the JDBC driver to use for the JDBC connection This driver (usually a jar) should be passed in the jars’ parameter
· metastore_table (str) – The name of the metastore table
· jdbc_truncate (bool) – (spark_to_jdbc only) Whether or not Spark should truncate or drop and recreate the JDBC table This only takes effect if save_mode’ is set to Overwrite Also if the schema is different Spark cannot truncate and will drop and recreate
· save_mode (str) – The Spark savemode to use (eg overwrite append etc)
· save_format (str) – (jdbc_to_sparkonly) The Spark saveformat to use (eg parquet)
· batch_size (int) – (spark_to_jdbc only) The size of the batch to insert per round trip to the JDBC database Defaults to 1000
· fetch_size (int) – (jdbc_to_spark only) The size of the batch to fetch per round trip from the JDBC database Default depends on the JDBC driver
· num_partitions (int) – The maximum number of partitions that can be used by Spark simultaneously both for spark_to_jdbc and jdbc_to_spark operations This will also cap the number of JDBC connections that can be opened
· partition_column (str) – (jdbc_to_sparkonly) A numeric column to be used to partition the metastore table by If specified you must also specify num_partitions lower_bound upper_bound
· lower_bound (int) – (jdbc_to_sparkonly) Lower bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column upper_bound
· upper_bound (int) – (jdbc_to_sparkonly) Upper bound of the range of the numeric partition column to fetch If specified you must also specify num_partitions partition_column lower_bound
· create_table_column_types – (spark_to_jdbconly) The database column data types to use instead of the defaults when creating the table Data type information should be specified in the same format as CREATE TABLE columns syntax (eg name CHAR(64) comments VARCHAR(1024)) The specified types should be valid spark sql data types
Type
jdbc_conn_id str
class airflowcontribhooksspark_sql_hookSparkSqlHook(sql confNone conn_id'spark_sql_default' total_executor_coresNone executor_coresNone executor_memoryNone keytabNone principalNone master'yarn' name'defaultname' num_executorsNone verboseTrue yarn_queue'default')[source]
Bases airflowhooksbase_hookBaseHook
This hook is a wrapper around the sparksql binary It requires that the sparksql binary is in the PATH param sql The SQL query to execute type sql str param conf arbitrary Spark configuration property type conf str (format PROPVALUE) param conn_id connection_id string type conn_id str param total_executor_cores (Standalone & Mesos only) Total cores for all executors
(Default all the available cores on the worker)
Parameters
· executor_cores (int) – (Standalone & YARN only) Number of cores per executor (Default 2)
· executor_memory (str) – Memory per executor (eg 1000M 2G) (Default 1G)
· keytab (str) – Full path to the file that contains the keytab
· master (str) – sparkhostport mesoshostport yarn or local
· name (str) – Name of the job
· num_executors (int) – Number of executors to launch
· verbose (bool) – Whether to pass the verbose flag to sparksql
· yarn_queue (str) – The YARN queue to submit to (Default default)
run_query(cmd'' **kwargs)[source]
Remote Popen (actually execute the Sparksql query)
Parameters
· cmd – command to remotely execute
· kwargs – extra arguments to Popen (see subprocessPopen)
class airflowcontribhooksspark_submit_hookSparkSubmitHook(confNone conn_id'spark_default' filesNone py_filesNone driver_classpathNone jarsNone java_classNone packagesNone exclude_packagesNone repositoriesNone total_executor_coresNone executor_coresNone executor_memoryNone driver_memoryNone keytabNone principalNone name'defaultname' num_executorsNone application_argsNone env_varsNone verboseFalse)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
This hook is a wrapper around the sparksubmit binary to kick off a sparksubmit job It requires that the sparksubmit binary is in the PATH or the spark_home to be supplied param conf Arbitrary Spark configuration properties type conf dict param conn_id The connection id as configured in Airflow administration When an
invalid connection_id is supplied it will default to yarn
Parameters
· files (str) – Upload additional files to the executor running the job separated by a comma Files will be placed in the working directory of each executor For example serialized objects
· py_files (str) – Additional python files used by the job can be zip egg or py
· driver_classpath (str) – Additional driverspecific classpath settings
· jars (str) – Submit additional jars to upload and place them in executor classpath
· java_class (str) – the main class of the Java application
· packages – Commaseparated list of maven coordinates of jars to include on the
driver and executor classpaths type packages str param exclude_packages Commaseparated list of maven coordinates of jars to exclude while resolving the dependencies provided in packages’ type exclude_packages str param repositories Commaseparated list of additional remote repositories to search for the maven coordinates given with packages’ type repositories str param total_executor_cores (Standalone & Mesos only) Total cores for all executors (Default all the available cores on the worker) type total_executor_cores int param executor_cores (Standalone YARN and Kubernetes only) Number of cores per executor (Default 2) type executor_cores int param executor_memory Memory per executor (eg 1000M 2G) (Default 1G) type executor_memory str param driver_memory Memory allocated to the driver (eg 1000M 2G) (Default 1G) type driver_memory str param keytab Full path to the file that contains the keytab type keytab str param principal The name of the kerberos principal used for keytab type principal str param name Name of the job (default airflowspark) type name str param num_executors Number of executors to launch type num_executors int param application_args Arguments for the application being submitted type application_args list param env_vars Environment variables for sparksubmit It
supports yarn and k8s mode too
Parameters
verbose (bool) – Whether to pass the verbose flag to sparksubmit process for debugging
submit(application'' **kwargs)[source]
Remote Popen to execute the sparksubmit job
Parameters
· application (str) – Submitted application jar or py file
· kwargs – extra arguments to Popen (see subprocessPopen)
class airflowcontribhookssqoop_hookSqoopHook(conn_id'sqoop_default' verboseFalse num_mappersNone hcatalog_databaseNone hcatalog_tableNone propertiesNone)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
This hook is a wrapper around the sqoop 1 binary To be able to use the hook it is required that sqoop is in the PATH
Additional arguments that can be passed via the extra’ JSON field of the sqoop connection * job_tracker Job tracker local|jobtrackerport * namenode Namenode * lib_jars Comma separated jar files to include in the classpath * files Comma separated files to be copied to the map reduce cluster * archives Comma separated archives to be unarchived on the compute
machines
· password_file Path to file containing the password
Parameters
· conn_id (str) – Reference to the sqoop connection
· verbose (bool) – Set sqoop to verbose
· num_mappers (int) – Number of map tasks to import in parallel
· properties (dict) – Properties to set via the D argument
Popen(cmd **kwargs)[source]
Remote Popen
Parameters
· cmd – command to remotely execute
· kwargs – extra arguments to Popen (see subprocessPopen)
Returns
handle to subprocess
export_table(table export_dir input_null_string input_null_non_string staging_table clear_staging_table enclosed_by escaped_by input_fields_terminated_by input_lines_terminated_by input_optionally_enclosed_by batch relaxed_isolation extra_export_optionsNone)[source]
Exports Hive table to remote location Arguments are copies of direct sqoop command line Arguments param table Table remote destination param export_dir Hive table to export param input_null_string The string to be interpreted as null for
string columns
Parameters
· input_null_non_string – The string to be interpreted as null for nonstring columns
· staging_table – The table in which data will be staged before being inserted into the destination table
· clear_staging_table – Indicate that any data present in the staging table can be deleted
· enclosed_by – Sets a required field enclosing character
· escaped_by – Sets the escape character
· input_fields_terminated_by – Sets the field separator character
· input_lines_terminated_by – Sets the endofline character
· input_optionally_enclosed_by – Sets a field enclosing character
· batch – Use batch mode for underlying statement execution
· relaxed_isolation – Transaction isolation to read uncommitted for the mappers
· extra_export_options – Extra export options to pass as dict If a key doesn’t have a value just pass an empty string to it Don’t include prefix of – for sqoop options
import_query(query target_dir appendFalse file_type'text' split_byNone directNone driverNone extra_import_optionsNone)[source]
Imports a specific query from the rdbms to hdfs param query Free format query to run param target_dir HDFS destination dir param append Append data to an existing dataset in HDFS param file_type avro sequence text or parquet
Imports data to hdfs into the specified format Defaults to text
Parameters
· split_by – Column of the table used to split work units
· direct – Use direct import fast path
· driver – Manually specify JDBC driver class to use
· extra_import_options – Extra import options to pass as dict If a key doesn’t have a value just pass an empty string to it Don’t include prefix of – for sqoop options
import_table(table target_dirNone appendFalse file_type'text' columnsNone split_byNone whereNone directFalse driverNone extra_import_optionsNone)[source]
Imports table from remote location to target dir Arguments are copies of direct sqoop command line arguments param table Table to read param target_dir HDFS destination dir param append Append data to an existing dataset in HDFS param file_type avro sequence text or parquet
Imports data to into the specified format Defaults to text
Parameters
· columns – Columns to import from table
· split_by – Column of the table used to split work units
· where – WHERE clause to use during import
· direct – Use direct connector if exists for the database
· driver – Manually specify JDBC driver class to use
· extra_import_options – Extra import options to pass as dict If a key doesn’t have a value just pass an empty string to it Don’t include prefix of – for sqoop options
class airflowcontribhooksssh_hookSSHHook(ssh_conn_idNone remote_hostNone usernameNone passwordNone key_fileNone portNone timeout10 keepalive_interval30)[source]
Bases airflowhooksbase_hookBaseHook airflowutilsloglogging_mixinLoggingMixin
Hook for ssh remote execution using Paramiko ref httpsgithubcomparamikoparamiko This hook also lets you create ssh tunnel and serve as basis for SFTP file transfer
Parameters
· ssh_conn_id (str) – connection id from airflow Connections from where all the required parameters can be fetched like username password or key_file Thought the priority is given to the param passed during init
· remote_host (str) – remote host to connect
· username (str) – username to connect to the remote_host
· password (str) – password of the username to connect to the remote_host
· key_file (str) – key file to use to connect to the remote_host
· port (int) – port of remote host to connect (Default is paramiko SSH_PORT)
· timeout (int) – timeout for the attempt to connect to the remote_host
· keepalive_interval (int) – send a keepalive packet to remote host every keepalive_interval seconds
get_conn()[source]
Opens a ssh connection to the remote host
return paramikoSSHClient object
get_tunnel(remote_port remote_host'localhost' local_portNone)[source]
Creates a tunnel between two hosts Like ssh L host
Parameters
· remote_port (int) – The remote port to create a tunnel to
· remote_host (str) – The remote host to create a tunnel to (default localhost)
· local_port (int) – The local port to attach the tunnel to
Returns
sshtunnelSSHTunnelForwarder object
class airflowcontribhooksvertica_hookVerticaHook(*args **kwargs)[source]
Bases airflowhooksdbapi_hookDbApiHook
Interact with Vertica
get_conn()[source]
Returns verticaql connection object
Executor
Executors are the mechanism by which task instances get run
class airflowexecutorslocal_executorLocalExecutor(parallelism32)[source]
Bases airflowexecutorsbase_executorBaseExecutor
LocalExecutor executes tasks locally in parallel It uses the multiprocessing Python library and queues to parallelize the execution of tasks
end()[source]
This method is called when the caller is done submitting job and wants to wait synchronously for the job submitted previously to be all done
execute_async(key command queueNone executor_configNone)[source]
This method will execute the command asynchronously
start()[source]
Executors may need to get things started For example LocalExecutor starts N workers
sync()[source]
Sync will get called periodically by the heartbeat method Executors should override this to perform gather statuses
class airflowexecutorssequential_executorSequentialExecutor[source]
Bases airflowexecutorsbase_executorBaseExecutor
This executor will only run one task instance at a time can be used for debugging It is also the only executor that can be used with sqlite since sqlite doesn’t support multiple connections
Since we want airflow to work out of the box it defaults to this SequentialExecutor alongside sqlite as you first install it
end()[source]
This method is called when the caller is done submitting job and wants to wait synchronously for the job submitted previously to be all done
execute_async(key command queueNone executor_configNone)[source]
This method will execute the command asynchronously
sync()[source]
Sync will get called periodically by the heartbeat method Executors should override this to perform gather statuses
Communitycontributed executors
class airflowcontribexecutorsmesos_executorMesosExecutor(parallelism32)[source]
Bases airflowexecutorsbase_executorBaseExecutor airflowwwwutilsLoginMixin
MesosExecutor allows distributing the execution of task instances to multiple mesos workers
Apache Mesos is a distributed systems kernel which abstracts CPU memory storage and other compute resources away from machines (physical or virtual) enabling faulttolerant and elastic distributed systems to easily be built and run effectively See httpmesosapacheorg
end()[source]
This method is called when the caller is done submitting job and wants to wait synchronously for the job submitted previously to be all done
execute_async(key command queueNone executor_configNone)[source]
This method will execute the command asynchronously
start()[source]
Executors may need to get things started For example LocalExecutor starts N workers
sync()[source]
Sync will get called periodically by the heartbeat method Executors should override this to perform gather statuses

Revision 14f9b55c
Built with Sphinx using a theme provided by Read the Docs
Read the Docs v latest
Apache Airflow 20 新特性
Apache Airflow 20 Release文档做整理文章容Apache Airflow官方Apache Airflow 20 is hereAstronomer(Apache Airflow云服务提供商)Introducing Airflow 20
官方文档翻译+注解方式说明Apache Airflow 20 版新特性
TaskFlow API(AIP31) 种新编写dags方式
DAGs现更容易进行编写特PythonOperator时候务间赖更清楚XCom更加
TaskFlow API编写dag案例
from airflowdecorators import dag task
from airflowutilsdates import days_ago

@dag(default_args{'owner' 'airflow'} schedule_intervalNone start_datedays_ago(2))
def tutorial_taskflow_api_etl()
@task
def extract()
return {1001 30127 1002 43321 1003 50222}

@task
def transform(order_data_dict dict) > dict
total_order_value 0

for value in order_data_dictvalues()
total_order_value + value

return {total_order_value total_order_value}

@task()
def load(total_order_value float)

print(Total order value is 2f total_order_value)

order_data extract()
order_summary transform(order_data)
load(order_summary[total_order_value])

tutorial_etl_dag tutorial_taskflow_api_etl()
面代码编写airflowtask变更简单需dag task需方法面添加注解行极提高生产力棒
更详细信息请参考
TaskFlow API Tutorial
TaskFlow API Documentation
完全REST API(AIP32)
现完全实验性 API 全面 OpenAPI 规范
文档REST API提供两套线文档界面进行交互Swagger静态Redoc需Airflow REST API户非常友

Swagger

Redoc


更关REST API详细信息参考
REST API Documentation
调度器性显著提升
作 AIP15(Scheduler HA+performance) Kamil 做工作部分显着提高 Airflow Scheduler 性现更快启动务
Astronomerio已调度程序进行基准测试——快(数字进行三次检查开始太相信)
具体性测试参考面两表格分延迟基准测试调度器水扩展基准测试
基准性提升
We have been using task latency as the key metric to benchmark scheduler performance and validate improvements Often evident in the Gantt view of the Airflow UI we define task latency as the time it takes for a task to begin executing once its dependencies have been met 直务延迟作衡量调度程序性验证改进关键指标 通常 Airflow UI 甘特图视图中明显务延迟定义务满足赖关系开始执行需时间
Along with the above architectural changes Airflow 20 also incorporates optimizations in the task startup process and in the scheduler loop which reduces task latency 述架构变化Airflow 20 务启动程调度器循环中进行优化减少务延迟
To sufficiently test this without skewing numbers based on the actual task work time we have chosen to benchmark using a simple BashOperator task with a trivial execution time The benchmarking configuration was 4 Celery Workers PostgreSQL DB 1 Web Server 1 Scheduler 充分测试点根实际务工作时间偏斜数字选择简单 BashOperator 务进行基准测试执行时间短 基准测试配置:4 Celery WorkersPostgreSQL DB1 Web 服务器1 调度程序
Results for 1000 tasks run measured as total task latency (referenced below as task lag) 运行 1000 务结果总务延迟(称务延迟)衡量
Scenario
DAG shape
11010 Total Task Lag
20 beta Total Task Lag
Speedup
100 DAG files 1 DAG per file
10 Tasks per DAG
Linear
200 seconds
116 seconds
17 times
10 DAG files 1 DAG per file
100 Tasks per DAG
Linear
144 seconds
143 seconds
10 times
10 DAG files 10 DAGs per file
10 Tasks per DAG
Binary Tree
200 seconds
12 seconds
16 times
出 airflowtask间流转延迟降低较110版取十倍提升
扩展性基准
We have been using task throughput as the key metric for measuring Airflow scalability and to identify bottlenecks Task throughput is measured in tasks per minute This represents the number of tasks that can be scheduled queued executed and monitored by Airflow every minute 直务吞吐量作衡量 Airflow 扩展性识瓶颈关键指标 务吞吐量分钟务数衡量 表示 Airflow 分钟调度排队执行监控务数量
To sufficiently test this without skewing numbers based on the actual task work time we have chosen to benchmark using a simple PythonOperator task with a trivial execution time The benchmarking configuration was Celery Workers PostgreSQL DB 1 Web Server 充分测试点根实际务工作时间偏斜数字选择简单 PythonOperator 务进行基准测试执行时间短 基准测试配置:Celery WorkersPostgreSQL DB1 Web 服务器
Results for task throughput (metric explained above) using Airflow 20 beta builds run with 5000 DAGs each with 10 parallel tasks on a single Airflow deployment The benchmark was performed on Google Cloud and each Scheduler was run on a n1standard1 machine type Airflow 20 测试版构建务吞吐量(指标见文)结果运行 5000 DAG DAG 单 Airflow 部署 10 行务 基准测试 Google Cloud 执行调度程序 n1standard1 机器类型运行
Schedulers
Workers
Task Throughput (average)
Task Throughput (low)
Task Throughput (high)
1
12
285
248
323
2
12
541
492
578
3
12
698
6325
774

面图表知道Airflow 20 调度器具备线性扩展力显著提升Airflow调度力
调度器高兼容 (AIP15)
现支持运行调度程序实例 弹性(防调度程序出现障)调度性非常
完全功您需 Postgres 96+ MySQL 8+(MySQL 5 MariaDB 恐怕调度程序)
运行调度程序需配置设置——需方启动调度程序(确保访问 DAG 文件)通数库您现调度程序合作
更关调度器高信息参考 Scheduler HA documentation
务组 (AIP34)
SubDAG 通常 UI 中务进行分组执行行许缺点(行执行单务)改善种体验引入Task Group(务组)种组织务方法提供 SubDAG 相分组行没执行时间缺陷
SubDAG 现然效认 SubDAG 前现换Task Group(务组)果您发现种情况示例请 GitHub 创建issue告诉
更关务组详细信息参考 Task Group documentation
崭新户界面
已 Airflow UI a visual refresh 更新样式

图表视图中添加动刷新务状态选项您需连续刷新钮)
查 文档中屏幕截图 解更信息
减少传感器负载智传感器 (AIP17)
果您 Airflow 集群中量传感器您会发现reschedule模式传感器执行会占集群部分改善点添加种称Smart Sensors(智传感器)新模式
功处抢先体验阶段已 Airbnb 充分测试稳定保留未版中进行兼容更改权利(果必须话会 非常努力)
Read more about it in the Smart Sensors documentation
简化KubernetesExecutor
Airflow 20重新架构 KubernetesExecutor时 Airflow 户更快更容易理解更灵活 户现访问完整 Kubernetes API 创建 yaml pod_template_file airflowcfg 中指定参数
 pod_override 参数换 executor_config 字典该参数采 Kubernetes V1Pod 象进行 a11 设置覆盖 更改 KubernetesExecutor 中删 3000 行代码运行速度更快减少潜错误
更关简化KubernetesExecutor相关信息请参考
Docs on pod_template_file
Docs on pod_override
Airflow core(核心)providers(第三方安装包) Airflow 拆分 60 包:

Airflow 20 单统天包已 Airflow 分core(核心) 61 (目前)provider packages(第三方安装包)provider package(第三方安装包)特定外部服务(GoogleAmazonMicrosoftSnowflake)数库(PostgresMySQL)协议 (HTTPFTP)现您构建块创建定义 Airflow 安装仅选择您需容添加您求常见提供程序会动安装(ftphttpimapsqlite)常您安装 Airflow 时选择适附加功时会动安装提供程序
provider architecture (第三方安装包架构)应该更容易获完全定制具正确 Python 赖项集致运行时
全部您编写定义提供程序理方式添加定义连接类型连接表单定义指操作员额外链接等容您构建提供程序安装 Python 包 Airflow UI 中直接显示您定义设置
Jarek Potiuk 博客写关 providers in much more detail 文章
更关providers相关信息参考
Docs on the providers concept and writing custom providers
Docs on the all providers packages available
安全性
作 Airflow 20 努力部分意识关注安全性减少暴露区域 功领域形式表现出 例新 REST API 中操作现需授权样配置设置中现需指定 Fernet 密钥
配置
airflowcfg 文件形式配置部分进步合理化特围绕core模块外量配置选项已弃移特定组件配置文件例 Kubernetes 执行相关配置 podtemplatefile

基Apache Airflow企业级数框架架构设计
Spark作数计算引擎话基Airflow现代码编写企业级数框架直接通yml配置文件动生成DAG数流通yml配置指定DAG运行Spark SQLPyspark代码Hive SQL保存结果数库(MySQL)DML SQL通配置指定导入数文件路径数质量检查连接外部数源数导出文件发送数邮件通知邮件第三方软件台调API集成等样基配置开发方式助开发效率期维护效率提升框架中项目运行性通整框架性调优优化整框架Python语言开发提升开发效率时兼容Python代码写Airflow DAG
该数框架根目录建立10文件夹作
1存储Python代码写Airflow DAGpy文件(兼容)
2存储数计算集群初始化配置启动脚(启动AWS EMRsh脚)
3存储操作集群存储空间文件(包括查询复制移动删压缩解压加密解密文件等操作)校验业务数(数完整性检查数质量检查)校验yml配置文件正确性完整性第三方软件台调API集成运行程序启动进程日志记录查发送邮件系统账号密码安全传输存储享文件夹交互数文件工具函数函数库py文件
4存储目录结构化yml配置文件集根业务模型业务题进行目录分类存储业务题放单独yml文件yml文件包含联系邮箱列表该框架整合第三方时通讯软件账号DAG开始运行日期时间行务数活动行务数业务题yml文件统放文件夹文件夹放框架根目录子目录里yml配置文件数获取处理类型分手工文件数源获取基础数基础数筛选滤加工数加工数进行连接分组统计数统计数数取报表数数导出文件数源通享文件夹时消息软件邮箱输出数分放文件夹中位框架根目录子目录里
5存储Airflow种DAGsensor定义种配置信息模板等
6存储Airflow框架日志
7存储框架赖jar库
8存储pyspark脚文件集
9存储SQL脚文件集(Spark SQLHive SQL报表数库表视图DDLDMLDCL)
10存储Python虚拟环境

文档香网(httpswwwxiangdangnet)户传

《香当网》用户分享的内容,不代表《香当网》观点或立场,请自行判断内容的真实性和可靠性!
该内容是文档的文本内容,更好的格式请下载文档

下载文档,方便阅读与编辑

文档的实际排版效果,会与网站的显示效果略有不同!!

需要 10 香币 [ 分享文档获得香币 ]

该文档为用户出售和定价!

购买文档

相关文档

AE开发实例代码总结

1、AE开发技术文档一、数据加载问题1、加载个人数据库个人数据库是保存在Access中的数据库。加载方式有两种:通过名字和通过属性加载(也许不只是这两种,AE中实现同一功能可以有多种方式)A、通过设置属性加载个人数据库首先通过IPropertySet接口 定义要连接数据库的一些相关属性,在个人数据库中为数据库的路径,例如:IPropertySet Propset=

l***i 3年前 上传627   0

IOS开发环境搭建

IOS环境搭建与开发入门一、 注册APPLE ID1. 在苹果官网上下载iTunes。 官方下载地址:2. 安装iTunes.3. 启动iTunes,在导航栏选择iTunes store4. 将显示页拉至最下面,选择管理->更改国家或地区 5. 选择切换到美国(United States)6. 选择找到FREE APPS(免费软件)点击FREE AP

郭***林 2年前 上传380   0

数据结构大作业(含源代码)

数据结构大作业作业题目: 职工信息管理系统 姓 名: 学 号: 班 级: 计算机班 指导教师: 日 期: 2010年X月X日 职工信息管理系统(学院计算机科学

文***享 3年前 上传454   0

基于Action的数据分析大数据平台

 基于Action的用户行为分析大数据平台Action-based user behavior analytics big data platform内容摘要电商平台作为当前最受欢迎,热度最高的平台,流量高,数据量大,数据种类多本文利用了逆向工程思维从现在热度高、流量高、数据量大的各个电商网站平台,对用户行为收集js脚本进行分析,并从多方面对脚本采集的数据进行判断和推测其具体内

平***苏 10个月前 上传249   0

实用版本技术开发合同

实用版本技术开发合同 技术合同其实很简洁的,今日我就给大家来看看技术合同,大家一起来保藏一下哦 有关于技术开发合同 项目名称:_______ 托付方:________ (甲方) 讨论开发方:______ (乙方) 签订地点:_____省______市(县) 签订日期:____年__月__日 有效期限

天***猩 1个月前 上传149   0

关于异地开发中的源代码管理问题

关于异地开发中的源代码管理问题最近在带领一个异地的团队在进行.NetB/S系统开发工作。两地相隔1000多公里,两地都有开发人员,源码的统一管理就成了需要解决的问题。针对这个问题,想到如下的解决方法:   一、利用MicrosoftVisualSourceSafe的Internet功能   优点:   1.考虑使用VSS是因为他与MicrosoftVisualStudio集成的很紧密。可以在

晓***1 10年前 上传521   0

敏捷开发中高质量Java代码开发实践

本文将介绍在敏捷开发过程中如何通过采取一系列的步骤来保证和提高整个工程的代码质量,阐述了每一步可以利用的工具和最正确实践,从而使开发过程更加标准化,成就高质量的代码。概述Java 工程开发过程中,由于开发人员的经验、代码风格各不相同,以及缺乏统一的标准和管理流程,往往导致整个工程的代码质量较差,难于维护,需要较大的测试投入和周期等问题。这些问题在一个工程组初建、需求和设计均具有不完全可预

静***雅 2年前 上传303   0

领导之前要先搭建信任平台

领导之前要先搭建信任平台央视?百家讲坛?中讲?汉代风云人物?时,讲了冤死的晁错?的案例。晁错是一个忠诚度高而工作起来又很忘我的“好员工“,而其之所以从一个“宠臣“沦落到“冤死鬼“,笔者认为主要是其“领导“上司无方,以致招致众怒,从而最终“误了卿卿性命“,成为了那个时代的“牺牲品“。  确实,“领导“上司,犹如“伴虎“,是典型的“太岁头上动土“,充满着职业的风险。如不慎,有时会弄巧成拙,惹得

郭***林 8个月前 上传188   0

搭建网络问政平台

  搭建网络问政平台   引  言   2003年5月,国内第一个具有网络问政性质的平台——“网上民声”在山东**诞生。负责创办这个栏目的网络媒体,是当时成立仅有一年的胶东在线网站。2009年10月,“网上民声”栏目荣获第19届中国新闻奖名专栏一等奖,成为中国第一个获此殊荣的地市级网站。到目前全国已有180余家单位到胶东在线学习“网上民声”经验。2008年、2011年新华社“国内动态清

逆***7 9年前 上传7442   0

搭建一个管理平台

搭建一个管理平台中国的一批明星企业之所以会出现“其兴也勃,其亡也忽”的现象,是因为根本原因就在于企业壮大后,决策者要么盲目实行多元化经营,最后死在贪大求快上;要么出现“内耗”,死在管理瓶颈上。        这几乎已成为一种结局的模式。        因此有人说,中国企业亟须建立一套有效的管理制度,并把它彻底贯彻下去。           什么是管理?管理过程与阅读正

自***熟 10年前 上传588   0

国家队信息化平台数据库软件设计与开发合同

国家队信息化平台数据库软件设计与开发合同  项目名称:_____________________________  委托人:_______________________________  (甲方)  研究开发人:___________________________  (乙方)  签订地点:____省 (市)____市、县(区)  签订日期:____________

x***u 12年前 上传364   0

数据交换平台软件产品开发项目可行性报告(v3)

 电子信息产业发展基金 项目可行性报告 项目名称: 数据交换平台软件产品开发项目 承担单位: 北京湘计立德信息技术有限公司 地址及邮编: 北京市海淀区上地信息路28号信息大厦8层(100085) 项目负责人: 金丹峰 财务负责人: 李文化 单位负责人: 梁洪流 联系人及电话: 张辉 62979327 填报日期: 2003年2月28日 信息产业部印制

q***5 12年前 上传4332   0

国家队信息化平台数据库软件设计与开发合同

国家队信息化平台数据库软件设计与开发合同  项目名称:  委 托 人:  (甲方)  研究开发人:  (乙方)    签订地点:      省 (市)   市、县(区)  签订日期:  年  月  日  有效期限:  年  月  日至   年  月  日  填 表 说 明  一、技术开发合同是指当事人之间就国家队信息化平台的研究开发所订立的合同。  二

h***b 11年前 上传495   0

授信基础数据

单位:万元 项目简要报告 报审单位 中国民生银行深圳分行营业部--森博工作室 项目申请人 长城宽带网络服务有限公司 申请授信类型 综合授信额度 金 额 人民币30000万元 期 限 3年 已在我行授信 0 担保方式 担保 担保人/物 长城科技股份有限公司 资料完整合规 完整合规 一、申请人基本情况 主体资格 长城宽带网络

l***c 8年前 上传26663   0

公平与爱是构建和谐班级的基础(论文)

公平与爱是构建和谐班级的基础 刘猛 **县职业技术集团学校 摘要:目前,部分学校存在以学习成绩好坏、家长社会地位高低为标准依次安排座位,相比于中等生和差等生,优等生得到了更多的锻炼及表现自已的机会;在课堂交流上老师也更多地对学习成绩优秀学生给予肯定和偏爱,正是这种不公平现象导致了整个班级呈现出不和谐的局面,所以,只有教师真正公平地对待了每位学生,更多的关注教室边缘的学生,和谐班级的建设才能

教***旅 11年前 上传9745   0

锐捷教你轻松搭建经济实用网络平台-交换机解决方案

锐捷教你轻松搭建经济实用网络平台-交换机解决方案  在信息化的初期,企业已经在使用一些信息化设施,如台式机、便携式电脑、打印机、服务器、传真机等,以提高工作效率。一些业务系统,如办公自动化、财务系统等已经得到实际应用。在这个阶段,企业信息资源不能充分、快捷的共享,企业希望以较低的投入建成经济实用的网络平台,以满足信息资源的共享,数据的快速汇总和决策,低成本开展各项业务应用。同时,通过互联网访

小***烂 10年前 上传420   0

开发云平台提升企业信息应用和管理水平

XX油气田分公司建设300亿战略大气区和一流的天然气工业基地的奋斗目标,对信息化建设提出了更高要求;实施勘探开发三大攻坚战,需要信息技术提供强有力的支撑;推进技术进步、提高生产经营管理水平,需要信息化建设提供重要手段;加强企业内部控制管理,需要加快集中统一的信息系统建设。

s***d 6年前 上传1604   0

搭建全员素质提升平台 构筑人才强企战略高地

搭建全员素质提升平台 构筑人才强企战略高地   在现代企业中,员工是推动先进生产力发展的基本力量,是企业在激烈的市场竞争中实现持续发展的重要因素。实施员工素质提升工程具有鲜明的时代意义和现实意义,高素质的员工队伍是企业创新发展、持续发展的根本所在,是企业核心竞争能力的重要源泉。当前受国内外经济形势影响,煤炭经济运行形式严峻,煤炭企业受到一定冲击。因此,必须大力推进员工素质提升工程,不断增强员

L***歌 10年前 上传7863   0

搭建平台鼓励非公企业积极承担社会责任

商会组织要大力弘扬优秀企业家精神,发挥示范带动作用,推动民营企业积极履行社会责任。以强化理想信念,提高政治站位。深入开展非公经济人士理想信念教育,不断加强非公经济人士思想政治工作。

***高 2年前 上传391   0

国家社科基金项目申报数据代码表(2022年)

国家社科基金项目申报数据代码表编码类型代码名 称项目类别A重点项目B一般项目C青年项目D一般自选项目E青年自选项目F后期资助项目X西部项目学科分类马列·科社KSA马、恩、列、斯思想研究KSB毛泽东思想、邓小平理论研究KSC马克思主义思想史KSD科学社会主义KSE社会主义运动史(含国际共产主义运动)

教***心 1年前 上传546   0

基于WIFI模块和单片机的无线数据传输附代码

计算机科学与技术学院Project3课程设计2014-2015学年第二学期 班 级: 浦电子 组员姓名: 组员学号: 指导老师:2015年X月X日目 录第一章 阶段任务第二章 基于WIFI模块的无线数据传输的原理1.1 时钟模块1.2 最小单片机系统的原理1.

文***享 1年前 上传291   0

CISA资料-基础 信息安全管理实用规则

象其他重要业务资产一样,信息也是对组织业务至关重要的一种资产,因此需要加以适当地保护。在业务环境互连日益增加的情况下这一点显得尤为重要。这种互连性的增加导致信息暴露于日益增多的、范围越来越广的威胁和脆弱性当中(也可参考关于信息系统和网络的安全的OECD指南)。

小***库 4年前 上传658   0

数据开发工程师(Mysql DBA)

数据开发工程师(Mysql DBA)岗位说明岗位职责:1、负责MySQL数据库架构设计2、负责数据库管理维护,监控及性能优化;3、负责数据库运维标准化,规范化;4、数据库规范化文档编写及管理;5、配合公司产品和项目数据库设计;6、前瞻性数据库技术与解决方案研究。职位要求:1、大专以上学历,三年以上数据库开发或MySQL DBA、Maria DB、perco

还***兴 9年前 上传516   0

对信息基础数据信息记录检查

  关于对信息中心第二版信息管理系统运行数据痕迹被覆盖的通报 信息中心: 信息系统管理应严格按照国家和省有关信息系统建设的标准规范业务系统和数据库建设,制定明确的信息系统管理制度,加强对信息系统的安全性、精确性、完整性、有效性的控制,规范管理,减少和消除人为因素,确保操作严密有效,信息真实完整,保障数据安全。 经核实存在以下问题:发现信息中心第二版信息管理系统更改后没有数据痕迹记录,如再

s***y 9年前 上传7213   0

大数据助推禁毒工作 构建和谐社会

  大数据助推禁毒工作 构建和谐社会   按照市委、市政府关于“向毒品说‘不’,打一场禁毒人民战争”的动员部署,以“一年建机制、两年见成效、三年出拐点”为奋斗目标,明确了“社区党政负总责,党政主要领导是第一责任人”的方针,确定了从重“缴毒量、破案数”向“打源头、控环节”转变、从“专业戒管”向“科学戒管、长期联动”转变的“两条路径”,社区初步形成了“条专块统、社会参与”综合治理毒品问题新格局

木***9 7年前 上传3136   0

© 2006-2021 香当网   

  浙公网安备 33018302001162号
浙ICP备09019653号-34