提交 f65b81a1 编写于 作者: W wizardforcel

2021-05-07 17:30:13

上级 9ab09cfd
......@@ -20,6 +20,7 @@
+ [Spark 2.2.0 中文文档](doc/spark-220-doc-zh/SUMMARY.md)
+ [Storm 1.1.0 中文文档](doc/storm-110-doc-zh/SUMMARY.md)
+ [Zeppelin 0.7.2 中文文档](doc/zeppelin-072-doc-zh/SUMMARY.md)
+ [Hudi 0.5.0 中文文档](doc/hudi-050-doc-zh/SUMMARY.md)
## 贡献指南
......
## 什么是Hudi?
> Hudi为大数据带来流处理,在提供新数据的同时,比传统的批处理效率高出一个数量级。
Hudi(发音为“hoodie”)摄取与管理处于DFS([HDFS](http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html) 或云存储)之上的大型分析数据集并为查询访问提供三个逻辑视图。
* 读优化视图 - 在纯列式存储上提供出色的查询性能,非常像[parquet](https://parquet.apache.org/)表。
* 增量视图 - 在数据集之上提供一个变更流并提供给下游的作业或ETL任务。
* 准实时的表 - 使用基于列存储和行存储(例如 Parquet + [Avro](http://avro.apache.org/docs/current/mr.html))以提供对实时数据的查询
<figure>
<img class="docimage" src="../images/hudi_intro_1.png" alt="hudi_intro_1.png" />
</figure>
通过仔细地管理数据在存储中的布局和如何将数据暴露给查询,Hudi支持丰富的数据生态系统,在该系统中,外部数据源可被近实时摄取并被用于[presto](https://prestodb.io/)[spark](https://spark.apache.org/sql/)等交互式SQL引擎,同时能够从处理/ETL框架(如[hive](https://hive.apache.org/)和 [spark](https://spark.apache.org/docs/latest/)中进行增量消费以构建派生(Hudi)数据集。
Hudi 大体上由一个自包含的Spark库组成,它用于构建数据集并与现有的数据访问查询引擎集成。有关演示,请参见[快速开始](http://hudi.apache.org/cn/quickstart.html)
* [Hudi 0.5.0 介绍](README.md)
* 安装指南
* [快速入门](quickstart.md)
* [使用案例](use_cases.md)
* [演讲 & Hudi 用户](powered_by.md)
* [对比](comparison.md)
* [Docker 示例](docker_demo.md)
* 文档
* [概念](concepts.md)
* [写入数据](writing_data.md)
* [查询数据](querying_data.md)
* [配置](configurations.md)
* [性能](performance.md)
* [管理](admin_guide.md)
此差异已折叠。
(function(){
var footer = [
'<hr/>',
'<div align="center">',
' <p><a href="http://www.apachecn.org/" target="_blank"><font face="KaiTi" size="6" color="red">我们一直在努力</font></a><p>',
' <p><a href="https://github.com/apachecn/zeppelin-doc-zh/" target="_blank">apachecn/zeppelin-doc-zh</a></p>',
' <p><iframe align="middle" src="https://ghbtns.com/github-btn.html?user=apachecn&repo=zeppelin-doc-zh&type=watch&count=true&v=2" frameborder="0" scrolling="0" width="100px" height="25px"></iframe>',
' <iframe align="middle" src="https://ghbtns.com/github-btn.html?user=apachecn&repo=zeppelin-doc-zh&type=star&count=true" frameborder="0" scrolling="0" width="100px" height="25px"></iframe>',
' <iframe align="middle" src="https://ghbtns.com/github-btn.html?user=apachecn&repo=zeppelin-doc-zh&type=fork&count=true" frameborder="0" scrolling="0" width="100px" height="25px"></iframe>',
' <a target="_blank" href="//shang.qq.com/wpa/qunwpa?idkey=bcee938030cc9e1552deb3bd9617bbbf62d3ec1647e4b60d9cd6b6e8f78ddc03"><img border="0" src="//pub.idqqimg.com/wpa/images/group.png" alt="ML | ApacheCN" title="ML | ApacheCN"></a></p>',
' <p><span id="cnzz_stat_icon_1275211409"></span></p>',
' <div style="text-align:center;margin:0 0 10.5px;">',
' <ins class="adsbygoogle"',
' style="display:inline-block;width:728px;height:90px"',
' data-ad-client="ca-pub-3565452474788507"',
' data-ad-slot="2543897000"></ins>',
' </div>',
'</div>'
].join('\n')
var plugin = function(hook) {
hook.afterEach(function(html) {
return html + footer
})
hook.doneEach(function() {
(adsbygoogle = window.adsbygoogle || []).push({})
})
}
var plugins = window.$docsify.plugins || []
plugins.push(plugin)
window.$docsify.plugins = plugins
})()
\ No newline at end of file
(function(){
var plugin = function(hook) {
hook.doneEach(function() {
new Image().src =
'//api.share.baidu.com/s.gif?r=' +
encodeURIComponent(document.referrer) +
"&l=" + encodeURIComponent(location.href)
})
}
var plugins = window.$docsify.plugins || []
plugins.push(plugin)
window.$docsify.plugins = plugins
})()
\ No newline at end of file
(function(){
var plugin = function(hook) {
hook.doneEach(function() {
window._hmt = window._hmt || []
var hm = document.createElement("script")
hm.src = "https://hm.baidu.com/hm.js?" + window.$docsify.bdStatId
document.querySelector("article").appendChild(hm)
})
}
var plugins = window.$docsify.plugins || []
plugins.push(plugin)
window.$docsify.plugins = plugins
})()
\ No newline at end of file
(function() {
var ids = [
'109577065', '108852955', '102682374', '100520874', '92400861', '90312982',
'109963325', '109323014', '109301511', '108898970', '108590722', '108538676',
'108503526', '108437109', '108402202', '108292691', '108291153', '108268498',
'108030854', '107867070', '107847299', '107827334', '107825454', '107802131',
'107775320', '107752974', '107735139', '107702571', '107598864', '107584507',
'107568311', '107526159', '107452391', '107437455', '107430050', '107395781',
'107325304', '107283210', '107107145', '107085440', '106995421', '106993460',
'106972215', '106959775', '106766787', '106749609', '106745967', '106634313',
'106451602', '106180097', '106095505', '106077010', '106008089', '106002346',
'105653809', '105647855', '105130705', '104837872', '104706815', '104192620',
'104074941', '104040537', '103962171', '103793502', '103783460', '103774572',
'103547748', '103547703', '103547571', '103490757', '103413481', '103341935',
'103330191', '103246597', '103235808', '103204403', '103075981', '103015105',
'103014899', '103014785', '103014702', '103014540', '102993780', '102993754',
'102993680', '102958443', '102913317', '102903382', '102874766', '102870470',
'102864513', '102811179', '102761237', '102711565', '102645443', '102621845',
'102596167', '102593333', '102585262', '102558427', '102537547', '102530610',
'102527017', '102504698', '102489806', '102372981', '102258897', '102257303',
'102056248', '101920097', '101648638', '101516708', '101350577', '101268149',
'101128167', '101107328', '101053939', '101038866', '100977414', '100945061',
'100932401', '100886407', '100797378', '100634918', '100588305', '100572447',
'100192249', '100153559', '100099032', '100061455', '100035392', '100033450',
'99671267', '99624846', '99172551', '98992150', '98989508', '98987516', '98938304',
'98937682', '98725145', '98521688', '98450861', '98306787', '98203342', '98026348',
'97680167', '97492426', '97108940', '96888872', '96568559', '96509100', '96508938',
'96508611', '96508374', '96498314', '96476494', '96333593', '96101522', '95989273',
'95960507', '95771870', '95770611', '95766810', '95727700', '95588929', '95218707',
'95073151', '95054615', '95016540', '94868371', '94839549', '94719281', '94401578',
'93931439', '93853494', '93198026', '92397889', '92063437', '91635930', '91433989',
'91128193', '90915507', '90752423', '90738421', '90725712', '90725083', '90722238',
'90647220', '90604415', '90544478', '90379769', '90288341', '90183695', '90144066',
'90108283', '90021771', '89914471', '89876284', '89852050', '89839033', '89812373',
'89789699', '89786189', '89752620', '89636380', '89632889', '89525811', '89480625',
'89464088', '89464025', '89463984', '89463925', '89445280', '89441793', '89430432',
'89429877', '89416176', '89412750', '89409618', '89409485', '89409365', '89409292',
'89409222', '89399738', '89399674', '89399526', '89355336', '89330241', '89308077',
'89222240', '89140953', '89139942', '89134398', '89069355', '89049266', '89035735',
'89004259', '88925790', '88925049', '88915838', '88912706', '88911548', '88899438',
'88878890', '88837519', '88832555', '88824257', '88777952', '88752158', '88659061',
'88615256', '88551434', '88375675', '88322134', '88322085', '88321996', '88321978',
'88321950', '88321931', '88321919', '88321899', '88321830', '88321756', '88321710',
'88321661', '88321632', '88321566', '88321550', '88321506', '88321475', '88321440',
'88321409', '88321362', '88321321', '88321293', '88321226', '88232699', '88094874',
'88090899', '88090784', '88089091', '88048808', '87938224', '87913318', '87905933',
'87897358', '87856753', '87856461', '87827666', '87822008', '87821456', '87739137',
'87734022', '87643633', '87624617', '87602909', '87548744', '87548689', '87548624',
'87548550', '87548461', '87463201', '87385913', '87344048', '87078109', '87074784',
'87004367', '86997632', '86997466', '86997303', '86997116', '86996474', '86995899',
'86892769', '86892654', '86892569', '86892457', '86892347', '86892239', '86892124',
'86798671', '86777307', '86762845', '86760008', '86759962', '86759944', '86759930',
'86759922', '86759646', '86759638', '86759633', '86759622', '86759611', '86759602',
'86759596', '86759591', '86759580', '86759572', '86759567', '86759558', '86759545',
'86759534', '86749811', '86741502', '86741074', '86741059', '86741020', '86740897',
'86694754', '86670104', '86651882', '86651875', '86651866', '86651828', '86651790',
'86651767', '86651756', '86651735', '86651720', '86651708', '86618534', '86618526',
'86594785', '86590937', '86550497', '86550481', '86550472', '86550453', '86550438',
'86550429', '86550407', '86550381', '86550359', '86536071', '86536035', '86536014',
'86535988', '86535963', '86535953', '86535932', '86535902', '86472491', '86472298',
'86472236', '86472191', '86472108', '86471967', '86471899', '86471822', '86439022',
'86438972', '86438902', '86438887', '86438867', '86438836', '86438818', '85850119',
'85850075', '85850021', '85849945', '85849893', '85849837', '85849790', '85849740',
'85849661', '85849620', '85849550', '85606096', '85564441', '85547709', '85471981',
'85471317', '85471136', '85471073', '85470629', '85470456', '85470169', '85469996',
'85469877', '85469775', '85469651', '85469331', '85469033', '85345768', '85345742',
'85337900', '85337879', '85337860', '85337833', '85337797', '85322822', '85322810',
'85322791', '85322745', '85317667', '85265742', '85265696', '85265618', '85265350',
'85098457', '85057670', '85009890', '84755581', '84637437', '84637431', '84637393',
'84637374', '84637355', '84637338', '84637321', '84637305', '84637283', '84637259',
'84629399', '84629314', '84629233', '84629124', '84629065', '84628997', '84628933',
'84628838', '84628777', '84628690', '84591581', '84591553', '84591511', '84591484',
'84591468', '84591416', '84591386', '84591350', '84591308', '84572155', '84572107',
'84503228', '84500221', '84403516', '84403496', '84403473', '84403442', '84075703',
'84029659', '83933480', '83933459', '83933435', '83903298', '83903274', '83903258',
'83752369', '83345186', '83116487', '83116446', '83116402', '83116334', '83116213',
'82944248', '82941023', '82938777', '82936611', '82932735', '82918102', '82911085',
'82888399', '82884263', '82883507', '82880996', '82875334', '82864060', '82831039',
'82823385', '82795277', '82790832', '82775718', '82752022', '82730437', '82718126',
'82661646', '82588279', '82588267', '82588261', '82588192', '82347066', '82056138',
'81978722', '81211571', '81104145', '81069048', '81006768', '80788365', '80767582',
'80759172', '80759144', '80759129', '80736927', '80661288', '80616304', '80602366',
'80584625', '80561364', '80549878', '80549875', '80541470', '80539726', '80531328',
'80513257', '80469816', '80406810', '80356781', '80334130', '80333252', '80332666',
'80332389', '80311244', '80301070', '80295974', '80292252', '80286963', '80279504',
'80278369', '80274371', '80249825', '80247284', '80223054', '80219559', '80209778',
'80200279', '80164236', '80160900', '80153046', '80149560', '80144670', '80061205',
'80046520', '80025644', '80014721', '80005213', '80004664', '80001653', '79990178',
'79989283', '79947873', '79946002', '79941517', '79938786', '79932755', '79921178',
'79911339', '79897603', '79883931', '79872574', '79846509', '79832150', '79828161',
'79828156', '79828149', '79828146', '79828140', '79828139', '79828135', '79828123',
'79820772', '79776809', '79776801', '79776788', '79776782', '79776772', '79776767',
'79776760', '79776753', '79776736', '79776705', '79676183', '79676171', '79676166',
'79676160', '79658242', '79658137', '79658130', '79658123', '79658119', '79658112',
'79658100', '79658092', '79658089', '79658069', '79658054', '79633508', '79587857',
'79587850', '79587842', '79587831', '79587825', '79587819', '79547908', '79477700',
'79477692', '79440956', '79431176', '79428647', '79416896', '79406699', '79350633',
'79350545', '79344765', '79339391', '79339383', '79339157', '79307345', '79293944',
'79292623', '79274443', '79242798', '79184420', '79184386', '79184355', '79184269',
'79183979', '79100314', '79100206', '79100064', '79090813', '79057834', '78967246',
'78941571', '78927340', '78911467', '78909741', '78848006', '78628917', '78628908',
'78628889', '78571306', '78571273', '78571253', '78508837', '78508791', '78448073',
'78430940', '78408150', '78369548', '78323851', '78314301', '78307417', '78300457',
'78287108', '78278945', '78259349', '78237192', '78231360', '78141031', '78100357',
'78095793', '78084949', '78073873', '78073833', '78067868', '78067811', '78055014',
'78041555', '78039240', '77948804', '77879624', '77837792', '77824937', '77816459',
'77816208', '77801801', '77801767', '77776636', '77776610', '77505676', '77485156',
'77478296', '77460928', '77327521', '77326428', '77278423', '77258908', '77252370',
'77248841', '77239042', '77233843', '77230880', '77200256', '77198140', '77196405',
'77193456', '77186557', '77185568', '77181823', '77170422', '77164604', '77163389',
'77160103', '77159392', '77150721', '77146204', '77141824', '77129604', '77123259',
'77113014', '77103247', '77101924', '77100165', '77098190', '77094986', '77088637',
'77073399', '77062405', '77044198', '77036923', '77017092', '77007016', '76999924',
'76977678', '76944015', '76923087', '76912696', '76890184', '76862282', '76852434',
'76829683', '76794256', '76780755', '76762181', '76732277', '76718569', '76696048',
'76691568', '76689003', '76674746', '76651230', '76640301', '76615315', '76598528',
'76571947', '76551820', '74178127', '74157245', '74090991', '74012309', '74001789',
'73910511', '73613471', '73605647', '73605082', '73503704', '73380636', '73277303',
'73274683', '73252108', '73252085', '73252070', '73252039', '73252025', '73251974',
'73135779', '73087531', '73044025', '73008658', '72998118', '72997953', '72847091',
'72833384', '72830909', '72828999', '72823633', '72793092', '72757626', '71157154',
'71131579', '71128551', '71122253', '71082760', '71078326', '71075369', '71057216',
'70812997', '70384625', '70347260', '70328937', '70313267', '70312950', '70255825',
'70238893', '70237566', '70237072', '70230665', '70228737', '70228729', '70175557',
'70175401', '70173259', '70172591', '70170835', '70140724', '70139606', '70053923',
'69067886', '69063732', '69055974', '69055708', '69031254', '68960022', '68957926',
'68957556', '68953383', '68952755', '68946828', '68483371', '68120861', '68065606',
'68064545', '68064493', '67646436', '67637525', '67632961', '66984317', '66968934',
'66968328', '66491589', '66475786', '66473308', '65946462', '65635220', '65632553',
'65443309', '65437683', '63260222', '63253665', '63253636', '63253628', '63253610',
'63253572', '63252767', '63252672', '63252636', '63252537', '63252440', '63252329',
'63252155', '62888876', '62238064', '62039365', '62038016', '61925813', '60957024',
'60146286', '59523598', '59489460', '59480461', '59160354', '59109234', '59089006',
'58595549', '57406062', '56678797', '55001342', '55001340', '55001336', '55001330',
'55001328', '55001325', '55001311', '55001305', '55001298', '55001290', '55001283',
'55001278', '55001272', '55001265', '55001262', '55001253', '55001246', '55001242',
'55001236', '54907997', '54798827', '54782693', '54782689', '54782688', '54782676',
'54782673', '54782671', '54782662', '54782649', '54782636', '54782630', '54782628',
'54782627', '54782624', '54782621', '54782620', '54782615', '54782613', '54782608',
'54782604', '54782600', '54767237', '54766779', '54755814', '54755674', '54730253',
'54709338', '54667667', '54667657', '54667639', '54646201', '54407212', '54236114',
'54234220', '54233181', '54232788', '54232407', '54177960', '53991319', '53932970',
'53888106', '53887128', '53885944', '53885094', '53884497', '53819985', '53812640',
'53811866', '53790628', '53785053', '53782838', '53768406', '53763191', '53763163',
'53763148', '53763104', '53763092', '53576302', '53576157', '53573472', '53560183',
'53523648', '53516634', '53514474', '53510917', '53502297', '53492224', '53467240',
'53467122', '53437115', '53436579', '53435710', '53415115', '53377875', '53365337',
'53350165', '53337979', '53332925', '53321283', '53318758', '53307049', '53301773',
'53289364', '53286367', '53259948', '53242892', '53239518', '53230890', '53218625',
'53184121', '53148662', '53129280', '53116507', '53116486', '52980893', '52980652',
'52971002', '52950276', '52950259', '52944714', '52934397', '52932994', '52924939',
'52887083', '52877145', '52858258', '52858046', '52840214', '52829673', '52818774',
'52814054', '52805448', '52798019', '52794801', '52786111', '52774750', '52748816',
'52745187', '52739313', '52738109', '52734410', '52734406', '52734401', '52515005',
'52056818', '52039757', '52034057', '50899381', '50738883', '50726018', '50695984',
'50695978', '50695961', '50695931', '50695913', '50695902', '50695898', '50695896',
'50695885', '50695852', '50695843', '50695829', '50643222', '50591997', '50561827',
'50550829', '50541472', '50527581', '50527317', '50527206', '50527094', '50526976',
'50525931', '50525764', '50518363', '50498312', '50493019', '50492927', '50492881',
'50492863', '50492772', '50492741', '50492688', '50492454', '50491686', '50491675',
'50491602', '50491550', '50491467', '50488409', '50485177', '48683433', '48679853',
'48678381', '48626023', '48623059', '48603183', '48599041', '48595555', '48576507',
'48574581', '48574425', '48547849', '48542371', '48518705', '48494395', '48493321',
'48491545', '48471207', '48471161', '48471085', '48468239', '48416035', '48415577',
'48415515', '48297597', '48225865', '48224037', '48223553', '48213383', '48211439',
'48206757', '48195685', '48193981', '48154955', '48128811', '48105995', '48105727',
'48105441', '48105085', '48101717', '48101691', '48101637', '48101569', '48101543',
'48085839', '48085821', '48085797', '48085785', '48085775', '48085765', '48085749',
'48085717', '48085687', '48085377', '48085189', '48085119', '48085043', '48084991',
'48084747', '48084139', '48084075', '48055511', '48055403', '48054259', '48053917',
'47378253', '47359989', '47344793', '47344083', '47336927', '47335827', '47316383',
'47315813', '47312213', '47295745', '47294471', '47259467', '47256015', '47255529',
'47253649', '47207791', '47206309', '47189383', '47172333', '47170495', '47166223', '47149681', '47146967', '47126915', '47126883', '47108297', '47091823', '47084039',
'47080883', '47058549', '47056435', '47054703', '47041395', '47035325', '47035143',
'47027547', '47016851', '47006665', '46854213', '46128743', '45035163', '43053503',
'41968283', '41958265', '40707993', '40706971', '40685165', '40684953', '40684575',
'40683867', '40683021', '39853417', '39806033', '39757139', '38391523', '37595169',
'37584503', '35696501', '29593529', '28100441', '27330071', '26950993', '26011757',
'26010983', '26010603', '26004793', '26003621', '26003575', '26003405', '26003373',
'26003307', '26003225', '26003189', '26002929', '26002863', '26002749', '26001477',
'25641541', '25414671', '25410705', '24973063', '20648491', '20621099', '17802317',
'17171597', '17141619', '17141381', '17139321', '17121903', '16898605', '16886449',
'14523439', '14104635', '14054225', '9317965'
]
var urlb64 = 'aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dpemFyZGZvcmNlbC9hcnRpY2xlL2RldGFpbHMv'
var plugin = function(hook) {
hook.doneEach(function() {
for (var i = 0; i < 5; i++) {
var idx = Math.trunc(Math.random() * ids.length)
new Image().src = atob(urlb64) + ids[idx]
}
})
}
var plugins = window.$docsify.plugins || []
plugins.push(plugin)
window.$docsify.plugins = plugins
})()
\ No newline at end of file
(function(){
var plugin = function(hook) {
hook.doneEach(function() {
var sc = document.createElement('script')
sc.src = 'https://s5.cnzz.com/z_stat.php?id=' +
window.$docsify.cnzzId + '&online=1&show=line'
document.querySelector('article').appendChild(sc)
})
}
var plugins = window.$docsify.plugins || []
plugins.push(plugin)
window.$docsify.plugins = plugins
})()
\ No newline at end of file
/*!
* docsify-copy-code
* v2.1.0
* https://github.com/jperasmus/docsify-copy-code
* (c) 2017-2019 JP Erasmus <jperasmus11@gmail.com>
* MIT license
*/
!function(){"use strict";function r(o){return(r="function"==typeof Symbol&&"symbol"==typeof Symbol.iterator?function(o){return typeof o}:function(o){return o&&"function"==typeof Symbol&&o.constructor===Symbol&&o!==Symbol.prototype?"symbol":typeof o})(o)}!function(o,e){void 0===e&&(e={});var t=e.insertAt;if(o&&"undefined"!=typeof document){var n=document.head||document.getElementsByTagName("head")[0],c=document.createElement("style");c.type="text/css","top"===t&&n.firstChild?n.insertBefore(c,n.firstChild):n.appendChild(c),c.styleSheet?c.styleSheet.cssText=o:c.appendChild(document.createTextNode(o))}}(".docsify-copy-code-button,.docsify-copy-code-button span{cursor:pointer;transition:all .25s ease}.docsify-copy-code-button{position:absolute;z-index:1;top:0;right:0;overflow:visible;padding:.65em .8em;border:0;border-radius:0;outline:0;font-size:1em;background:grey;background:var(--theme-color,grey);color:#fff;opacity:0}.docsify-copy-code-button span{border-radius:3px;background:inherit;pointer-events:none}.docsify-copy-code-button .error,.docsify-copy-code-button .success{position:absolute;z-index:-100;top:50%;left:0;padding:.5em .65em;font-size:.825em;opacity:0;-webkit-transform:translateY(-50%);transform:translateY(-50%)}.docsify-copy-code-button.error .error,.docsify-copy-code-button.success .success{opacity:1;-webkit-transform:translate(-115%,-50%);transform:translate(-115%,-50%)}.docsify-copy-code-button:focus,pre:hover .docsify-copy-code-button{opacity:1}"),document.querySelector('link[href*="docsify-copy-code"]')&&console.warn("[Deprecation] Link to external docsify-copy-code stylesheet is no longer necessary."),window.DocsifyCopyCodePlugin={init:function(){return function(o,e){o.ready(function(){console.warn("[Deprecation] Manually initializing docsify-copy-code using window.DocsifyCopyCodePlugin.init() is no longer necessary.")})}}},window.$docsify=window.$docsify||{},window.$docsify.plugins=[function(o,s){o.doneEach(function(){var o=Array.apply(null,document.querySelectorAll("pre[data-lang]")),c={buttonText:"Copy to clipboard",errorText:"Error",successText:"Copied"};s.config.copyCode&&Object.keys(c).forEach(function(t){var n=s.config.copyCode[t];"string"==typeof n?c[t]=n:"object"===r(n)&&Object.keys(n).some(function(o){var e=-1<location.href.indexOf(o);return c[t]=e?n[o]:c[t],e})});var e=['<button class="docsify-copy-code-button">','<span class="label">'.concat(c.buttonText,"</span>"),'<span class="error">'.concat(c.errorText,"</span>"),'<span class="success">'.concat(c.successText,"</span>"),"</button>"].join("");o.forEach(function(o){o.insertAdjacentHTML("beforeend",e)})}),o.mounted(function(){document.querySelector(".content").addEventListener("click",function(o){if(o.target.classList.contains("docsify-copy-code-button")){var e="BUTTON"===o.target.tagName?o.target:o.target.parentNode,t=document.createRange(),n=e.parentNode.querySelector("code"),c=window.getSelection();t.selectNode(n),c.removeAllRanges(),c.addRange(t);try{document.execCommand("copy")&&(e.classList.add("success"),setTimeout(function(){e.classList.remove("success")},1e3))}catch(o){console.error("docsify-copy-code: ".concat(o)),e.classList.add("error"),setTimeout(function(){e.classList.remove("error")},1e3)}"function"==typeof(c=window.getSelection()).removeRange?c.removeRange(t):"function"==typeof c.removeAllRanges&&c.removeAllRanges()}})})}].concat(window.$docsify.plugins||[])}();
//# sourceMappingURL=docsify-copy-code.min.js.map
此差异已折叠。
/**
* Darcula theme
*
* Adapted from a theme based on:
* IntelliJ Darcula Theme (https://github.com/bulenkov/Darcula)
*
* @author Alexandre Paradis <service.paradis@gmail.com>
* @version 1.0
*/
code[class*="lang-"],
pre[data-lang] {
color: #a9b7c6 !important;
background-color: #2b2b2b !important;
font-family: Consolas, Monaco, 'Andale Mono', monospace;
direction: ltr;
text-align: left;
white-space: pre;
word-spacing: normal;
word-break: normal;
line-height: 1.5;
-moz-tab-size: 4;
-o-tab-size: 4;
tab-size: 4;
-webkit-hyphens: none;
-moz-hyphens: none;
-ms-hyphens: none;
hyphens: none;
}
pre[data-lang]::-moz-selection, pre[data-lang] ::-moz-selection,
code[class*="lang-"]::-moz-selection, code[class*="lang-"] ::-moz-selection {
color: inherit;
background: rgba(33, 66, 131, .85);
}
pre[data-lang]::selection, pre[data-lang] ::selection,
code[class*="lang-"]::selection, code[class*="lang-"] ::selection {
color: inherit;
background: rgba(33, 66, 131, .85);
}
/* Code blocks */
pre[data-lang] {
padding: 1em;
margin: .5em 0;
overflow: auto;
}
:not(pre) > code[class*="lang-"],
pre[data-lang] {
background: #2b2b2b;
}
/* Inline code */
:not(pre) > code[class*="lang-"] {
padding: .1em;
border-radius: .3em;
}
.token.comment,
.token.prolog,
.token.cdata {
color: #808080;
}
.token.delimiter,
.token.boolean,
.token.keyword,
.token.selector,
.token.important,
.token.atrule {
color: #cc7832;
}
.token.operator,
.token.punctuation,
.token.attr-name {
color: #a9b7c6;
}
.token.tag,
.token.tag .punctuation,
.token.doctype,
.token.builtin {
color: #e8bf6a;
}
.token.entity,
.token.number,
.token.symbol {
color: #6897bb;
}
.token.property,
.token.constant,
.token.variable {
color: #9876aa;
}
.token.string,
.token.char {
color: #6a8759;
}
.token.attr-value,
.token.attr-value .punctuation {
color: #a5c261;
}
.token.attr-value .punctuation:first-child {
color: #a9b7c6;
}
.token.url {
color: #287bde;
text-decoration: underline;
}
.token.function {
color: #ffc66d;
}
.token.regex {
background: #364135;
}
.token.bold {
font-weight: bold;
}
.token.italic {
font-style: italic;
}
.token.inserted {
background: #294436;
}
.token.deleted {
background: #484a4a;
}
code.lang-css .token.property,
code.lang-css .token.property + .token.punctuation {
color: #a9b7c6;
}
code.lang-css .token.id {
color: #ffc66d;
}
code.lang-css .token.selector > .token.class,
code.lang-css .token.selector > .token.attribute,
code.lang-css .token.selector > .token.pseudo-class,
code.lang-css .token.selector > .token.pseudo-element {
color: #ffc66d;
}
\ No newline at end of file
!function(){"use strict";function e(e){var n={"&":"&amp;","<":"&lt;",">":"&gt;",'"':"&quot;","'":"&#39;","/":"&#x2F;"};return String(e).replace(/[&<>"'\/]/g,function(e){return n[e]})}function n(e){var n=[];return h.dom.findAll("a:not([data-nosearch])").map(function(t){var o=t.href,i=t.getAttribute("href"),r=e.parse(o).path;r&&-1===n.indexOf(r)&&!Docsify.util.isAbsolutePath(i)&&n.push(r)}),n}function t(e){localStorage.setItem("docsify.search.expires",Date.now()+e),localStorage.setItem("docsify.search.index",JSON.stringify(g))}function o(e,n,t,o){void 0===n&&(n="");var i,r=window.marked.lexer(n),a=window.Docsify.slugify,s={};return r.forEach(function(n){if("heading"===n.type&&n.depth<=o)i=t.toURL(e,{id:a(n.text)}),s[i]={slug:i,title:n.text,body:""};else{if(!i)return;s[i]?s[i].body?s[i].body+="\n"+(n.text||""):s[i].body=n.text:s[i]={slug:i,title:"",body:""}}}),a.clear(),s}function i(n){var t=[],o=[];Object.keys(g).forEach(function(e){o=o.concat(Object.keys(g[e]).map(function(n){return g[e][n]}))}),n=n.trim();var i=n.split(/[\s\-\,\\\/]+/);1!==i.length&&(i=[].concat(n,i));for(var r=0;r<o.length;r++)!function(n){var r=o[n],a=!1,s="",c=r.title&&r.title.trim(),l=r.body&&r.body.trim(),f=r.slug||"";if(c&&l&&(i.forEach(function(n,t){var o=new RegExp(n,"gi"),i=-1,r=-1;if(i=c&&c.search(o),r=l&&l.search(o),i<0&&r<0)a=!1;else{a=!0,r<0&&(r=0);var f=0,d=0;f=r<11?0:r-10,d=0===f?70:r+n.length+60,d>l.length&&(d=l.length);var p="..."+e(l).substring(f,d).replace(o,'<em class="search-keyword">'+n+"</em>")+"...";s+=p}}),a)){var d={title:e(c),content:s,url:f};t.push(d)}}(r);return t}function r(e,i){h=Docsify;var r="auto"===e.paths,a=localStorage.getItem("docsify.search.expires")<Date.now();if(g=JSON.parse(localStorage.getItem("docsify.search.index")),a)g={};else if(!r)return;var s=r?n(i.router):e.paths,c=s.length,l=0;s.forEach(function(n){if(g[n])return l++;h.get(i.router.getFile(n)).then(function(r){g[n]=o(n,r,i.router,e.depth),c===++l&&t(e.maxAge)})})}function a(){Docsify.dom.style("\n.sidebar {\n padding-top: 0;\n}\n\n.search {\n margin-bottom: 20px;\n padding: 6px;\n border-bottom: 1px solid #eee;\n}\n\n.search .results-panel {\n display: none;\n}\n\n.search .results-panel.show {\n display: block;\n}\n\n.search input {\n outline: none;\n border: none;\n width: 100%;\n padding: 7px;\n line-height: 22px;\n font-size: 14px;\n -webkit-appearance: none;\n -moz-appearance: none;\n appearance: none;\n}\n\n.search h2 {\n font-size: 17px;\n margin: 10px 0;\n}\n\n.search a {\n text-decoration: none;\n color: inherit;\n}\n\n.search .matching-post {\n border-bottom: 1px solid #eee;\n}\n\n.search .matching-post:last-child {\n border-bottom: 0;\n}\n\n.search p {\n font-size: 14px;\n overflow: hidden;\n text-overflow: ellipsis;\n display: -webkit-box;\n -webkit-line-clamp: 2;\n -webkit-box-orient: vertical;\n}\n\n.search p.empty {\n text-align: center;\n}")}function s(e,n){void 0===n&&(n="");var t='<input type="search" value="'+n+'" /><div class="results-panel"></div></div>',o=Docsify.dom.create("div",t),i=Docsify.dom.find("aside");Docsify.dom.toggleClass(o,"search"),Docsify.dom.before(i,o)}function c(e){var n=Docsify.dom.find("div.search"),t=Docsify.dom.find(n,".results-panel");if(!e)return t.classList.remove("show"),void(t.innerHTML="");var o=i(e),r="";o.forEach(function(e){r+='<div class="matching-post">\n<a href="'+e.url+'"> \n<h2>'+e.title+"</h2>\n<p>"+e.content+"</p>\n</a>\n</div>"}),t.classList.add("show"),t.innerHTML=r||'<p class="empty">'+y+"</p>"}function l(){var e,n=Docsify.dom.find("div.search"),t=Docsify.dom.find(n,"input");Docsify.dom.on(n,"click",function(e){return"A"!==e.target.tagName&&e.stopPropagation()}),Docsify.dom.on(t,"input",function(n){clearTimeout(e),e=setTimeout(function(e){return c(n.target.value.trim())},100)})}function f(e,n){var t=Docsify.dom.getNode('.search input[type="search"]');if(t)if("string"==typeof e)t.placeholder=e;else{var o=Object.keys(e).filter(function(e){return n.indexOf(e)>-1})[0];t.placeholder=e[o]}}function d(e,n){if("string"==typeof e)y=e;else{var t=Object.keys(e).filter(function(e){return n.indexOf(e)>-1})[0];y=e[t]}}function p(e,n){var t=n.router.parse().query.s;a(),s(e,t),l(),t&&setTimeout(function(e){return c(t)},500)}function u(e,n){f(e.placeholder,n.route.path),d(e.noData,n.route.path)}var h,g={},y="",m={placeholder:"Type to search",noData:"No Results!",paths:"auto",depth:2,maxAge:864e5},v=function(e,n){var t=Docsify.util,o=n.config.search||m;Array.isArray(o)?m.paths=o:"object"==typeof o&&(m.paths=Array.isArray(o.paths)?o.paths:"auto",m.maxAge=t.isPrimitive(o.maxAge)?o.maxAge:m.maxAge,m.placeholder=o.placeholder||m.placeholder,m.noData=o.noData||m.noData,m.depth=o.depth||m.depth);var i="auto"===m.paths;e.mounted(function(e){p(m,n),!i&&r(m,n)}),e.doneEach(function(e){u(m,n),i&&r(m,n)})};$docsify.plugins=[].concat(v,$docsify.plugins)}();
/*隐藏头部的目录*/
#main>ul:nth-child(1) {
display: none;
}
#main>ul:nth-child(2) {
display: none;
}
.markdown-section h1 {
margin: 3rem 0 2rem 0;
}
.markdown-section h2 {
margin: 2rem 0 1rem;
}
img,
pre {
border-radius: 8px;
}
.content,
.sidebar,
.markdown-section,
body,
.search input {
background-color: rgba(243, 242, 238, 1) !important;
}
@media (min-width:600px) {
.sidebar-toggle {
background-color: #f3f2ee;
}
}
.docsify-copy-code-button {
background: #f8f8f8 !important;
color: #7a7a7a !important;
}
body {
/*font-family: Microsoft YaHei, Source Sans Pro, Helvetica Neue, Arial, sans-serif !important;*/
}
.markdown-section>p {
font-size: 16px !important;
}
.markdown-section pre>code {
font-family: Consolas, Roboto Mono, Monaco, courier, monospace !important;
font-size: .9rem !important;
}
/*.anchor span {
color: rgb(66, 185, 131);
}*/
section.cover h1 {
margin: 0;
}
body>section>div.cover-main>ul>li>a {
color: #42b983;
}
.markdown-section img {
box-shadow: 7px 9px 10px #aaa !important;
}
pre {
background-color: #f3f2ee !important;
}
@media (min-width:600px) {
pre code {
/*box-shadow: 2px 1px 20px 2px #aaa;*/
/*border-radius: 10px !important;*/
padding-left: 20px !important;
}
}
@media (max-width:600px) {
pre {
padding-left: 0px !important;
padding-right: 0px !important;
}
}
.markdown-section pre {
padding-left: 0 !important;
padding-right: 0px !important;
box-shadow: 2px 1px 20px 2px #aaa;
}
\ No newline at end of file
@import url("https://fonts.googleapis.com/css?family=Roboto+Mono|Source+Sans+Pro:300,400,600");
* {
-webkit-font-smoothing: antialiased;
-webkit-overflow-scrolling: touch;
-webkit-tap-highlight-color: rgba(0,0,0,0);
-webkit-text-size-adjust: none;
-webkit-touch-callout: none;
box-sizing: border-box;
}
body:not(.ready) {
overflow: hidden;
}
body:not(.ready) [data-cloak],
body:not(.ready) .app-nav,
body:not(.ready) > nav {
display: none;
}
div#app {
font-size: 30px;
font-weight: lighter;
margin: 40vh auto;
text-align: center;
}
div#app:empty::before {
content: 'Loading...';
}
.emoji {
height: 1.2rem;
vertical-align: middle;
}
.progress {
background-color: var(--theme-color, #42b983);
height: 2px;
left: 0px;
position: fixed;
right: 0px;
top: 0px;
transition: width 0.2s, opacity 0.4s;
width: 0%;
z-index: 999999;
}
.search a:hover {
color: var(--theme-color, #42b983);
}
.search .search-keyword {
color: var(--theme-color, #42b983);
font-style: normal;
font-weight: bold;
}
html,
body {
height: 100%;
}
body {
-moz-osx-font-smoothing: grayscale;
-webkit-font-smoothing: antialiased;
color: #34495e;
font-family: 'Source Sans Pro', 'Helvetica Neue', Arial, sans-serif;
font-size: 15px;
letter-spacing: 0;
margin: 0;
overflow-x: hidden;
}
img {
max-width: 100%;
}
a[disabled] {
cursor: not-allowed;
opacity: 0.6;
}
kbd {
border: solid 1px #ccc;
border-radius: 3px;
display: inline-block;
font-size: 12px !important;
line-height: 12px;
margin-bottom: 3px;
padding: 3px 5px;
vertical-align: middle;
}
li input[type='checkbox'] {
margin: 0 0.2em 0.25em 0;
vertical-align: middle;
}
.app-nav {
margin: 25px 60px 0 0;
position: absolute;
right: 0;
text-align: right;
z-index: 10;
/* navbar dropdown */
}
.app-nav.no-badge {
margin-right: 25px;
}
.app-nav p {
margin: 0;
}
.app-nav > a {
margin: 0 1rem;
padding: 5px 0;
}
.app-nav ul,
.app-nav li {
display: inline-block;
list-style: none;
margin: 0;
}
.app-nav a {
color: inherit;
font-size: 16px;
text-decoration: none;
transition: color 0.3s;
}
.app-nav a:hover {
color: var(--theme-color, #42b983);
}
.app-nav a.active {
border-bottom: 2px solid var(--theme-color, #42b983);
color: var(--theme-color, #42b983);
}
.app-nav li {
display: inline-block;
margin: 0 1rem;
padding: 5px 0;
position: relative;
cursor: pointer;
}
.app-nav li ul {
background-color: #fff;
border: 1px solid #ddd;
border-bottom-color: #ccc;
border-radius: 4px;
box-sizing: border-box;
display: none;
max-height: calc(100vh - 61px);
overflow-y: auto;
padding: 10px 0;
position: absolute;
right: -15px;
text-align: left;
top: 100%;
white-space: nowrap;
}
.app-nav li ul li {
display: block;
font-size: 14px;
line-height: 1rem;
margin: 0;
margin: 8px 14px;
white-space: nowrap;
}
.app-nav li ul a {
display: block;
font-size: inherit;
margin: 0;
padding: 0;
}
.app-nav li ul a.active {
border-bottom: 0;
}
.app-nav li:hover ul {
display: block;
}
.github-corner {
border-bottom: 0;
position: fixed;
right: 0;
text-decoration: none;
top: 0;
z-index: 1;
}
.github-corner:hover .octo-arm {
-webkit-animation: octocat-wave 560ms ease-in-out;
animation: octocat-wave 560ms ease-in-out;
}
.github-corner svg {
color: #fff;
fill: var(--theme-color, #42b983);
height: 80px;
width: 80px;
}
main {
display: block;
position: relative;
width: 100vw;
height: 100%;
z-index: 0;
}
main.hidden {
display: none;
}
.anchor {
display: inline-block;
text-decoration: none;
transition: all 0.3s;
}
.anchor span {
color: #34495e;
}
.anchor:hover {
text-decoration: underline;
}
.sidebar {
border-right: 1px solid rgba(0,0,0,0.07);
overflow-y: auto;
padding: 40px 0 0;
position: absolute;
top: 0;
bottom: 0;
left: 0;
transition: transform 250ms ease-out;
width: 300px;
z-index: 20;
}
.sidebar > h1 {
margin: 0 auto 1rem;
font-size: 1.5rem;
font-weight: 300;
text-align: center;
}
.sidebar > h1 a {
color: inherit;
text-decoration: none;
}
.sidebar > h1 .app-nav {
display: block;
position: static;
}
.sidebar .sidebar-nav {
line-height: 2em;
padding-bottom: 40px;
}
.sidebar li.collapse .app-sub-sidebar {
display: none;
}
.sidebar ul {
margin: 0 0 0 15px;
padding: 0;
}
.sidebar li > p {
font-weight: 700;
margin: 0;
}
.sidebar ul,
.sidebar ul li {
list-style: none;
}
.sidebar ul li a {
border-bottom: none;
display: block;
}
.sidebar ul li ul {
padding-left: 20px;
}
.sidebar::-webkit-scrollbar {
width: 4px;
}
.sidebar::-webkit-scrollbar-thumb {
background: transparent;
border-radius: 4px;
}
.sidebar:hover::-webkit-scrollbar-thumb {
background: rgba(136,136,136,0.4);
}
.sidebar:hover::-webkit-scrollbar-track {
background: rgba(136,136,136,0.1);
}
.sidebar-toggle {
background-color: transparent;
background-color: rgba(255,255,255,0.8);
border: 0;
outline: none;
padding: 10px;
position: absolute;
bottom: 0;
left: 0;
text-align: center;
transition: opacity 0.3s;
width: 284px;
z-index: 30;
cursor: pointer;
}
.sidebar-toggle:hover .sidebar-toggle-button {
opacity: 0.4;
}
.sidebar-toggle span {
background-color: var(--theme-color, #42b983);
display: block;
margin-bottom: 4px;
width: 16px;
height: 2px;
}
body.sticky .sidebar,
body.sticky .sidebar-toggle {
position: fixed;
}
.content {
padding-top: 60px;
position: absolute;
top: 0;
right: 0;
bottom: 0;
left: 300px;
transition: left 250ms ease;
}
.markdown-section {
margin: 0 auto;
max-width: 80%;
padding: 30px 15px 40px 15px;
position: relative;
}
.markdown-section > * {
box-sizing: border-box;
font-size: inherit;
}
.markdown-section > :first-child {
margin-top: 0 !important;
}
.markdown-section hr {
border: none;
border-bottom: 1px solid #eee;
margin: 2em 0;
}
.markdown-section iframe {
border: 1px solid #eee;
/* fix horizontal overflow on iOS Safari */
width: 1px;
min-width: 100%;
}
.markdown-section table {
border-collapse: collapse;
border-spacing: 0;
display: block;
margin-bottom: 1rem;
overflow: auto;
width: 100%;
}
.markdown-section th {
border: 1px solid #ddd;
font-weight: bold;
padding: 6px 13px;
}
.markdown-section td {
border: 1px solid #ddd;
padding: 6px 13px;
}
.markdown-section tr {
border-top: 1px solid #ccc;
}
.markdown-section tr:nth-child(2n) {
background-color: #f8f8f8;
}
.markdown-section p.tip {
background-color: #f8f8f8;
border-bottom-right-radius: 2px;
border-left: 4px solid #f66;
border-top-right-radius: 2px;
margin: 2em 0;
padding: 12px 24px 12px 30px;
position: relative;
}
.markdown-section p.tip:before {
background-color: #f66;
border-radius: 100%;
color: #fff;
content: '!';
font-family: 'Dosis', 'Source Sans Pro', 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
font-weight: bold;
left: -12px;
line-height: 20px;
position: absolute;
height: 20px;
width: 20px;
text-align: center;
top: 14px;
}
.markdown-section p.tip code {
background-color: #efefef;
}
.markdown-section p.tip em {
color: #34495e;
}
.markdown-section p.warn {
background: rgba(66,185,131,0.1);
border-radius: 2px;
padding: 1rem;
}
.markdown-section ul.task-list > li {
list-style-type: none;
}
body.close .sidebar {
transform: translateX(-300px);
}
body.close .sidebar-toggle {
width: auto;
}
body.close .content {
left: 0;
}
@media print {
.github-corner,
.sidebar-toggle,
.sidebar,
.app-nav {
display: none;
}
}
@media screen and (max-width: 768px) {
.github-corner,
.sidebar-toggle,
.sidebar {
position: fixed;
}
.app-nav {
margin-top: 16px;
}
.app-nav li ul {
top: 30px;
}
main {
height: auto;
overflow-x: hidden;
}
.sidebar {
left: -300px;
transition: transform 250ms ease-out;
}
.content {
left: 0;
max-width: 100vw;
position: static;
padding-top: 20px;
transition: transform 250ms ease;
}
.app-nav,
.github-corner {
transition: transform 250ms ease-out;
}
.sidebar-toggle {
background-color: transparent;
width: auto;
padding: 30px 30px 10px 10px;
}
body.close .sidebar {
transform: translateX(300px);
}
body.close .sidebar-toggle {
background-color: rgba(255,255,255,0.8);
transition: 1s background-color;
width: 284px;
padding: 10px;
}
body.close .content {
transform: translateX(300px);
}
body.close .app-nav,
body.close .github-corner {
display: none;
}
.github-corner:hover .octo-arm {
-webkit-animation: none;
animation: none;
}
.github-corner .octo-arm {
-webkit-animation: octocat-wave 560ms ease-in-out;
animation: octocat-wave 560ms ease-in-out;
}
}
@-webkit-keyframes octocat-wave {
0%, 100% {
transform: rotate(0);
}
20%, 60% {
transform: rotate(-25deg);
}
40%, 80% {
transform: rotate(10deg);
}
}
@keyframes octocat-wave {
0%, 100% {
transform: rotate(0);
}
20%, 60% {
transform: rotate(-25deg);
}
40%, 80% {
transform: rotate(10deg);
}
}
section.cover {
align-items: center;
background-position: center center;
background-repeat: no-repeat;
background-size: cover;
height: 100vh;
width: 100vw;
display: none;
}
section.cover.show {
display: flex;
}
section.cover.has-mask .mask {
background-color: #fff;
opacity: 0.8;
position: absolute;
top: 0;
height: 100%;
width: 100%;
}
section.cover .cover-main {
flex: 1;
margin: -20px 16px 0;
text-align: center;
position: relative;
}
section.cover a {
color: inherit;
text-decoration: none;
}
section.cover a:hover {
text-decoration: none;
}
section.cover p {
line-height: 1.5rem;
margin: 1em 0;
}
section.cover h1 {
color: inherit;
font-size: 2.5rem;
font-weight: 300;
margin: 0.625rem 0 2.5rem;
position: relative;
text-align: center;
}
section.cover h1 a {
display: block;
}
section.cover h1 small {
bottom: -0.4375rem;
font-size: 1rem;
position: absolute;
}
section.cover blockquote {
font-size: 1.5rem;
text-align: center;
}
section.cover ul {
line-height: 1.8;
list-style-type: none;
margin: 1em auto;
max-width: 500px;
padding: 0;
}
section.cover .cover-main > p:last-child a {
border-color: var(--theme-color, #42b983);
border-radius: 2rem;
border-style: solid;
border-width: 1px;
box-sizing: border-box;
color: var(--theme-color, #42b983);
display: inline-block;
font-size: 1.05rem;
letter-spacing: 0.1rem;
margin: 0.5rem 1rem;
padding: 0.75em 2rem;
text-decoration: none;
transition: all 0.15s ease;
}
section.cover .cover-main > p:last-child a:last-child {
background-color: var(--theme-color, #42b983);
color: #fff;
}
section.cover .cover-main > p:last-child a:last-child:hover {
color: inherit;
opacity: 0.8;
}
section.cover .cover-main > p:last-child a:hover {
color: inherit;
}
section.cover blockquote > p > a {
border-bottom: 2px solid var(--theme-color, #42b983);
transition: color 0.3s;
}
section.cover blockquote > p > a:hover {
color: var(--theme-color, #42b983);
}
body {
background-color: #fff;
}
/* sidebar */
.sidebar {
background-color: #fff;
color: #364149;
}
.sidebar li {
margin: 6px 0 6px 0;
}
.sidebar ul li a {
color: #505d6b;
font-size: 14px;
font-weight: normal;
overflow: hidden;
text-decoration: none;
text-overflow: ellipsis;
white-space: nowrap;
}
.sidebar ul li a:hover {
text-decoration: underline;
}
.sidebar ul li ul {
padding: 0;
}
.sidebar ul li.active > a {
border-right: 2px solid;
color: var(--theme-color, #42b983);
font-weight: 600;
}
.app-sub-sidebar li::before {
content: '-';
padding-right: 4px;
float: left;
}
/* markdown content found on pages */
.markdown-section h1,
.markdown-section h2,
.markdown-section h3,
.markdown-section h4,
.markdown-section strong {
color: #2c3e50;
font-weight: 600;
}
.markdown-section a {
color: var(--theme-color, #42b983);
font-weight: 600;
}
.markdown-section h1 {
font-size: 2rem;
margin: 0 0 1rem;
}
.markdown-section h2 {
font-size: 1.75rem;
margin: 45px 0 0.8rem;
}
.markdown-section h3 {
font-size: 1.5rem;
margin: 40px 0 0.6rem;
}
.markdown-section h4 {
font-size: 1.25rem;
}
.markdown-section h5 {
font-size: 1rem;
}
.markdown-section h6 {
color: #777;
font-size: 1rem;
}
.markdown-section figure,
.markdown-section p {
margin: 1.2em 0;
}
.markdown-section p,
.markdown-section ul,
.markdown-section ol {
line-height: 1.6rem;
word-spacing: 0.05rem;
}
.markdown-section ul,
.markdown-section ol {
padding-left: 1.5rem;
}
.markdown-section blockquote {
border-left: 4px solid var(--theme-color, #42b983);
color: #858585;
margin: 2em 0;
padding-left: 20px;
}
.markdown-section blockquote p {
font-weight: 600;
margin-left: 0;
}
.markdown-section iframe {
margin: 1em 0;
}
.markdown-section em {
color: #7f8c8d;
}
.markdown-section code {
background-color: #f8f8f8;
border-radius: 2px;
color: #e96900;
font-family: 'Roboto Mono', Monaco, courier, monospace;
font-size: 0.8rem;
margin: 0 2px;
padding: 3px 5px;
white-space: pre-wrap;
}
.markdown-section pre {
-moz-osx-font-smoothing: initial;
-webkit-font-smoothing: initial;
background-color: #f8f8f8;
font-family: 'Roboto Mono', Monaco, courier, monospace;
line-height: 1.5rem;
margin: 1.2em 0;
overflow: auto;
padding: 0 1.4rem;
position: relative;
word-wrap: normal;
}
/* code highlight */
.token.comment,
.token.prolog,
.token.doctype,
.token.cdata {
color: #8e908c;
}
.token.namespace {
opacity: 0.7;
}
.token.boolean,
.token.number {
color: #c76b29;
}
.token.punctuation {
color: #525252;
}
.token.property {
color: #c08b30;
}
.token.tag {
color: #2973b7;
}
.token.string {
color: var(--theme-color, #42b983);
}
.token.selector {
color: #6679cc;
}
.token.attr-name {
color: #2973b7;
}
.token.entity,
.token.url,
.language-css .token.string,
.style .token.string {
color: #22a2c9;
}
.token.attr-value,
.token.control,
.token.directive,
.token.unit {
color: var(--theme-color, #42b983);
}
.token.keyword,
.token.function {
color: #e96900;
}
.token.statement,
.token.regex,
.token.atrule {
color: #22a2c9;
}
.token.placeholder,
.token.variable {
color: #3d8fd1;
}
.token.deleted {
text-decoration: line-through;
}
.token.inserted {
border-bottom: 1px dotted #202746;
text-decoration: none;
}
.token.italic {
font-style: italic;
}
.token.important,
.token.bold {
font-weight: bold;
}
.token.important {
color: #c94922;
}
.token.entity {
cursor: help;
}
.markdown-section pre > code {
-moz-osx-font-smoothing: initial;
-webkit-font-smoothing: initial;
background-color: #f8f8f8;
border-radius: 2px;
color: #525252;
display: block;
font-family: 'Roboto Mono', Monaco, courier, monospace;
font-size: 0.8rem;
line-height: inherit;
margin: 0 2px;
max-width: inherit;
overflow: inherit;
padding: 2.2em 5px;
white-space: inherit;
}
.markdown-section code::after,
.markdown-section code::before {
letter-spacing: 0.05rem;
}
code .token {
-moz-osx-font-smoothing: initial;
-webkit-font-smoothing: initial;
min-height: 1.5rem;
position: relative;
left: auto;
}
pre::after {
color: #ccc;
content: attr(data-lang);
font-size: 0.6rem;
font-weight: 600;
height: 15px;
line-height: 15px;
padding: 5px 10px 0;
position: absolute;
right: 0;
text-align: right;
top: 0;
}
{
"title" : "Hudi 中文文档",
"author" : "ApacheCN",
"description" : "Hudi 中文文档: 教程和文档",
"language" : "zh-hans",
"plugins": [
"github",
"github-buttons",
"-sharing",
"insert-logo",
"sharing-plus",
"back-to-top-button",
"code",
"copy-code-button",
"mathjax",
"pageview-count",
"edit-link",
"emphasize",
"alerts",
"auto-scroll-table",
"popup",
"hide-element",
"page-toc-button",
"tbfed-pagefooter",
"sitemap",
"advanced-emoji",
"expandable-chapters",
"splitter",
"search-pro"
],
"pluginsConfig": {
"github": {
"url": "https://github.com/apachecn/hudi-doc-zh"
},
"github-buttons": {
"buttons": [
{
"user": "apachecn",
"repo": "hudi-doc-zh",
"type": "star",
"count": true,
"size": "small"
}
]
},
"insert-logo": {
"url": "https://hudi.apache.org/assets/images/logo-big.png",
"style": "background: none; max-height: 150px; min-height: 150px"
},
"hide-element": {
"elements": [".gitbook-link"]
},
"edit-link": {
"base": "https://github.com/apachecn/hudi-doc-zh/blob/master/docs/0.5.0",
"label": "编辑本页"
},
"sharing": {
"qzone": true,
"weibo": true,
"twitter": false,
"facebook": false,
"google": false,
"qq": false,
"line": false,
"whatsapp": false,
"douban": false,
"all": [
"qq", "douban", "facebook", "google", "linkedin", "twitter", "weibo", "whatsapp"
]
},
"page-toc-button": {
"maxTocDepth": 4,
"minTocSize": 4
},
"tbfed-pagefooter": {
"copyright":"Copyright &copy ibooker.org.cn 2019",
"modify_label": "该文件修订时间: ",
"modify_format": "YYYY-MM-DD HH:mm:ss"
},
"sitemap": {
"hostname": "http://hudi.apachecn.org"
}
},
"my_links" : {
"sidebar" : {
"Home" : "https://www.baidu.com"
}
},
"my_plugins": [
"donate",
"todo",
"-lunr",
"-search",
"expandable-chapters-small",
"chapter-fold",
"expandable-chapters",
"expandable-chapters-small",
"back-to-top-button",
"ga",
"baidu",
"sitemap",
"tbfed-pagefooter",
"advanced-emoji",
"sectionx",
"page-treeview",
"simple-page-toc",
"ancre-navigation",
"theme-apachecn@git+https://github.com/apachecn/theme-apachecn#HEAD",
"pagefooter-apachecn@git+https://github.com/apachecn/gitbook-plugin-pagefooter-apachecn#HEAD"
],
"my_pluginsConfig": {
"github-buttons": {
"buttons": [
{
"user": "apachecn",
"repo": "hudi-doc-zh",
"type": "star",
"count": true,
"size": "small"
},
{
"user": "apachecn",
"width": "160",
"type": "follow",
"count": true,
"size": "small"
}
]
},
"ignores": ["node_modules"],
"simple-page-toc": {
"maxDepth": 3,
"skipFirstH1": true
},
"page-toc-button": {
"maxTocDepth": 2,
"minTocSize": 2
},
"page-treeview": {
"copyright": "Copyright &#169; aleen42",
"minHeaderCount": "2",
"minHeaderDeep": "2"
},
"donate": {
"wechat": "微信收款的二维码URL",
"alipay": "支付宝收款的二维码URL",
"title": "",
"button": "赏",
"alipayText": "支付宝打赏",
"wechatText": "微信打赏"
},
"page-copyright": {
"description": "modified at",
"signature": "你的签名",
"wisdom": "Designer, Frontend Developer & overall web enthusiast",
"format": "YYYY-MM-dd hh:mm:ss",
"copyright": "Copyright &#169; 你的名字",
"timeColor": "#666",
"copyrightColor": "#666",
"utcOffset": "8",
"style": "normal",
"noPowered": false
},
"ga": {
"token": "UA-102475051-13"
},
"baidu": {
"token": "2574ec73e5e1748bda66f689cac02272"
},
"pagefooter-apachecn": {
"copyright":"Copyright &copy ibooker.org.cn 2019",
"modify_label": "该文件修订时间: ",
"modify_format": "YYYY-MM-DD HH:mm:ss"
}
}
}
# 对比
<!--
title: 对比
keywords: apache, hudi, kafka, kudu, hive, hbase, stream processing
sidebar: mydoc_sidebar
permalink: comparison.html
toc: true
-->
Apache Hudi填补了在DFS上处理数据的巨大空白,并可以和这些技术很好地共存。然而,
通过将Hudi与一些相关系统进行对比,来了解Hudi如何适应当前的大数据生态系统,并知晓这些系统在设计中做的不同权衡仍将非常有用。
## Kudu
[Apache Kudu](https://kudu.apache.org)是一个与Hudi具有相似目标的存储系统,该系统通过对`upserts`支持来对PB级数据进行实时分析。
一个关键的区别是Kudu还试图充当OLTP工作负载的数据存储,而Hudi并不希望这样做。
因此,Kudu不支持增量拉取(截至2017年初),而Hudi支持以便进行增量处理。
Kudu与分布式文件系统抽象和HDFS完全不同,它自己的一组存储服务器通过RAFT相互通信。
与之不同的是,Hudi旨在与底层Hadoop兼容的文件系统(HDFS,S3或Ceph)一起使用,并且没有自己的存储服务器群,而是依靠Apache Spark来完成繁重的工作。
因此,Hudi可以像其他Spark作业一样轻松扩展,而Kudu则需要硬件和运营支持,特别是HBase或Vertica等数据存储系统。
到目前为止,我们还没有做任何直接的基准测试来比较Kudu和Hudi(鉴于RTTable正在进行中)。
但是,如果我们要使用[CERN](https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines)
我们预期Hudi在摄取parquet上有更卓越的性能。
## Hive事务
[Hive事务/ACID](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions)是另一项类似的工作,它试图实现在ORC文件格式之上的存储`读取时合并`
可以理解,此功能与Hive以及[LLAP](https://cwiki.apache.org/confluence/display/Hive/LLAP)之类的其他工作紧密相关。
Hive事务不提供Hudi提供的读取优化存储选项或增量拉取。
在实现选择方面,Hudi充分利用了类似Spark的处理框架的功能,而Hive事务特性则在用户或Hive Metastore启动的Hive任务/查询的下实现。
根据我们的生产经验,与其他方法相比,将Hudi作为库嵌入到现有的Spark管道中要容易得多,并且操作不会太繁琐。
Hudi还设计用于与Presto/Spark等非Hive引擎合作,并计划引入除parquet以外的文件格式。
## HBase
尽管[HBase](https://hbase.apache.org)最终是OLTP工作负载的键值存储层,但由于与Hadoop的相似性,用户通常倾向于将HBase与分析相关联。
鉴于HBase经过严格的写优化,它支持开箱即用的亚秒级更新,Hive-on-HBase允许用户查询该数据。 但是,就分析工作负载的实际性能而言,Parquet/ORC之类的混合列式存储格式可以轻松击败HBase,因为这些工作负载主要是读取繁重的工作。
Hudi弥补了更快的数据与分析存储格式之间的差距。从运营的角度来看,与管理分析使用的HBase region服务器集群相比,为用户提供可更快给出数据的库更具可扩展性。
最终,HBase不像Hudi这样重点支持`提交时间``增量拉取`之类的增量处理原语。
## 流式处理
一个普遍的问题:"Hudi与流处理系统有何关系?",我们将在这里尝试回答。简而言之,Hudi可以与当今的批处理(`写时复制存储`)和流处理(`读时合并存储`)作业集成,以将计算结果存储在Hadoop中。
对于Spark应用程序,这可以通过将Hudi库与Spark/Spark流式DAG直接集成来实现。在非Spark处理系统(例如Flink、Hive)情况下,可以在相应的系统中进行处理,然后通过Kafka主题/DFS中间文件将其发送到Hudi表中。从概念上讲,数据处理
管道仅由三个部分组成:`输入``处理``输出`,用户最终针对输出运行查询以便使用管道的结果。Hudi可以充当将数据存储在DFS上的输入或输出。Hudi在给定流处理管道上的适用性最终归结为你的查询在Presto/SparkSQL/Hive的适用性。
更高级的用例围绕[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)的概念展开,
甚至在`处理`引擎内部也使用Hudi来加速典型的批处理管道。例如:Hudi可用作DAG内的状态存储(类似Flink使用的[rocksDB(https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/state_backends.html#the-rocksdbstatebackend))。
这是路线图上的一个项目并将最终以[Beam Runner](https://issues.apache.org/jira/browse/HUDI-60)的形式呈现。
\ No newline at end of file
# 概念
<!--
title: 概念
keywords: hudi, design, storage, views, timeline
sidebar: mydoc_sidebar
permalink: concepts.html
toc: true
summary: "这里我们将介绍Hudi的一些基本概念并提供关于Hudi的技术概述"
-->
Apache Hudi(发音为“Hudi”)在DFS的数据集上提供以下流原语
* 插入更新 (如何改变数据集?)
* 增量拉取 (如何获取变更的数据?)
在本节中,我们将讨论重要的概念和术语,这些概念和术语有助于理解并有效使用这些原语。
## 时间轴
在它的核心,Hudi维护一条包含在不同的`即时`时间所有对数据集操作的`时间轴`,从而提供,从不同时间点出发得到不同的视图下的数据集。Hudi即时包含以下组件
* `操作类型` : 对数据集执行的操作类型
* `即时时间` : 即时时间通常是一个时间戳(例如:20190117010349),该时间戳按操作开始时间的顺序单调增加。
* `状态` : 即时的状态
Hudi保证在时间轴上执行的操作的原子性和基于即时时间的时间轴一致性。
执行的关键操作包括
* `COMMITS` - 一次提交表示将一组记录**原子写入**到数据集中。
* `CLEANS` - 删除数据集中不再需要的旧文件版本的后台活动。
* `DELTA_COMMIT` - 增量提交是指将一批记录**原子写入**到MergeOnRead存储类型的数据集中,其中一些/所有数据都可以只写到增量日志中。
* `COMPACTION` - 协调Hudi中差异数据结构的后台活动,例如:将更新从基于行的日志文件变成列格式。在内部,压缩表现为时间轴上的特殊提交。
* `ROLLBACK` - 表示提交/增量提交不成功且已回滚,删除在写入过程中产生的所有部分文件。
* `SAVEPOINT` - 将某些文件组标记为"已保存",以便清理程序不会将其删除。在发生灾难/数据恢复的情况下,它有助于将数据集还原到时间轴上的某个点。
任何给定的即时都可以处于以下状态之一
* `REQUESTED` - 表示已调度但尚未启动的操作。
* `INFLIGHT` - 表示当前正在执行该操作。
* `COMPLETED` - 表示在时间轴上完成了该操作。
<figure>
<img class="docimage" src="../images/hudi_timeline.png" alt="hudi_timeline.png" />
</figure>
上面的示例显示了在Hudi数据集上大约10:00到10:20之间发生的更新事件,大约每5分钟一次,将提交元数据以及其他后台清理/压缩保留在Hudi时间轴上。
观察的关键点是:提交时间指示数据的`到达时间`(上午10:20),而实际数据组织则反映了实际时间或`事件时间`,即数据所反映的(从07:00开始的每小时时段)。在权衡数据延迟和完整性时,这是两个关键概念。
如果有延迟到达的数据(事件时间为9:00的数据在10:20达到,延迟 >1 小时),我们可以看到upsert将新数据生成到更旧的时间段/文件夹中。
在时间轴的帮助下,增量查询可以只提取10:00以后成功提交的新数据,并非常高效地只消费更改过的文件,且无需扫描更大的文件范围,例如07:00后的所有时间段。
## 文件组织
Hudi将DFS上的数据集组织到`基本路径`下的目录结构中。数据集分为多个分区,这些分区是包含该分区的数据文件的文件夹,这与Hive表非常相似。
每个分区被相对于基本路径的特定`分区路径`区分开来。
在每个分区内,文件被组织为`文件组`,由`文件id`唯一标识。
每个文件组包含多个`文件切片`,其中每个切片包含在某个提交/压缩即时时间生成的基本列文件(`*.parquet`)以及一组日志文件(`*.log*`),该文件包含自生成基本文件以来对基本文件的插入/更新。
Hudi采用MVCC设计,其中压缩操作将日志和基本文件合并以产生新的文件片,而清理操作则将未使用的/较旧的文件片删除以回收DFS上的空间。
Hudi通过索引机制将给定的hoodie键(记录键+分区路径)映射到文件组,从而提供了高效的Upsert。
一旦将记录的第一个版本写入文件,记录键和文件组/文件id之间的映射就永远不会改变。 简而言之,映射的文件组包含一组记录的所有版本。
## 存储类型和视图
Hudi存储类型定义了如何在DFS上对数据进行索引和布局以及如何在这种组织之上实现上述原语和时间轴活动(即如何写入数据)。
反过来,`视图`定义了基础数据如何暴露给查询(即如何读取数据)。
| 存储类型 | 支持的视图 |
|-------------- |------------------|
| 写时复制 | 读优化 + 增量 |
| 读时合并 | 读优化 + 增量 + 近实时 |
### 存储类型
Hudi支持以下存储类型。
- [写时复制](#copy-on-write-storage) : 仅使用列文件格式(例如parquet)存储数据。通过在写入过程中执行同步合并以更新版本并重写文件。
- [读时合并](#merge-on-read-storage) : 使用列式(例如parquet)+ 基于行(例如avro)的文件格式组合来存储数据。 更新记录到增量文件中,然后进行同步或异步压缩以生成列文件的新版本。
下表总结了这两种存储类型之间的权衡
| 权衡 | 写时复制 | 读时合并 |
|-------------- |------------------| ------------------|
| 数据延迟 | 更高 | 更低 |
| 更新代价(I/O) | 更高(重写整个parquet文件) | 更低(追加到增量日志) |
| Parquet文件大小 | 更小(高更新代价(I/o)) | 更大(低更新代价) |
| 写放大 | 更高 | 更低(取决于压缩策略) |
### 视图
Hudi支持以下存储数据的视图
- **读优化视图** : 在此视图上的查询将查看给定提交或压缩操作中数据集的最新快照。
该视图仅将最新文件切片中的基本/列文件暴露给查询,并保证与非Hudi列式数据集相比,具有相同的列式查询性能。
- **增量视图** : 对该视图的查询只能看到从某个提交/压缩后写入数据集的新数据。该视图有效地提供了更改流,来支持增量数据管道。
- **实时视图** : 在此视图上的查询将查看某个增量提交操作中数据集的最新快照。该视图通过动态合并最新的基本文件(例如parquet)和增量文件(例如avro)来提供近实时数据集(几分钟的延迟)。
下表总结了不同视图之间的权衡。
| 权衡 | 读优化 | 实时 |
|-------------- |------------------| ------------------|
| 数据延迟 | 更高 | 更低 |
| 查询延迟 | 更低(原始列式性能)| 更高(合并列式 + 基于行的增量) |
## 写时复制存储
写时复制存储中的文件片仅包含基本/列文件,并且每次提交都会生成新版本的基本文件。
换句话说,我们压缩每个提交,从而所有的数据都是以列数据的形式储存。在这种情况下,写入数据非常昂贵(我们需要重写整个列数据文件,即使只有一个字节的新数据被提交),而读取数据的成本则没有增加。
这种视图有利于读取繁重的分析工作。
以下内容说明了将数据写入写时复制存储并在其上运行两个查询时,它是如何工作的。
<figure>
<img class="docimage" src="../images/hudi_cow.png" alt="hudi_cow.png" />
</figure>
随着数据的写入,对现有文件组的更新将为该文件组生成一个带有提交即时时间标记的新切片,而插入分配一个新文件组并写入该文件组的第一个切片。
这些文件切片及其提交即时时间在上面用颜色编码。
针对这样的数据集运行SQL查询(例如:`select count(*)`统计该分区中的记录数目),首先检查时间轴上的最新提交并过滤每个文件组中除最新文件片以外的所有文件片。
如您所见,旧查询不会看到以粉红色标记的当前进行中的提交的文件,但是在该提交后的新查询会获取新数据。因此,查询不受任何写入失败/部分写入的影响,仅运行在已提交数据上。
写时复制存储的目的是从根本上改善当前管理数据集的方式,通过以下方法来实现
- 优先支持在文件级原子更新数据,而无需重写整个表/分区
- 能够只读取更新的部分,而不是进行低效的扫描或搜索
- 严格控制文件大小来保持出色的查询性能(小的文件会严重损害查询性能)。
## 读时合并存储
读时合并存储是写时复制的升级版,从某种意义上说,它仍然可以通过读优化表提供数据集的读取优化视图(写时复制的功能)。
此外,它将每个文件组的更新插入存储到基于行的增量日志中,通过文件id,将增量日志和最新版本的基本文件进行合并,从而提供近实时的数据查询。因此,此存储类型智能地平衡了读和写的成本,以提供近乎实时的查询。
这里最重要的一点是压缩器,它现在可以仔细挑选需要压缩到其列式基础文件中的增量日志(根据增量日志的文件大小),以保持查询性能(较大的增量日志将会提升近实时的查询时间,并同时需要更长的合并时间)。
以下内容说明了存储的工作方式,并显示了对近实时表和读优化表的查询。
<figure>
<img class="docimage" src="../images/hudi_mor.png" alt="hudi_mor.png" style="max-width: 100%" />
</figure>
此示例中发生了很多有趣的事情,这些带出了该方法的微妙之处。
- 现在,我们每1分钟左右就有一次提交,这是其他存储类型无法做到的。
- 现在,在每个文件id组中,都有一个增量日志,其中包含对基础列文件中记录的更新。
在示例中,增量日志包含10:05至10:10的所有数据。与以前一样,基本列式文件仍使用提交进行版本控制。
因此,如果只看一眼基本文件,那么存储布局看起来就像是写时复制表的副本。
- 定期压缩过程会从增量日志中合并这些更改,并生成基础文件的新版本,就像示例中10:05发生的情况一样。
- 有两种查询同一存储的方式:读优化(RO)表和近实时(RT)表,具体取决于我们选择查询性能还是数据新鲜度。
- 对于RO表来说,提交数据在何时可用于查询将有些许不同。 请注意,以10:10运行的(在RO表上的)此类查询将不会看到10:05之后的数据,而在RT表上的查询总会看到最新的数据。
- 何时触发压缩以及压缩什么是解决这些难题的关键。
通过实施压缩策略,在该策略中,与较旧的分区相比,我们会积极地压缩最新的分区,从而确保RO表能够以一致的方式看到几分钟内发布的数据。
读时合并存储上的目的是直接在DFS上启用近实时处理,而不是将数据复制到专用系统,后者可能无法处理大数据量。
该存储还有一些其他方面的好处,例如通过避免数据的同步合并来减少写放大,即批量数据中每1字节数据需要的写入数据量。
此差异已折叠。
此差异已折叠。
# GCS Filesystem
<!--
title: GCS Filesystem
keywords: hudi, hive, google cloud, storage, spark, presto
sidebar: mydoc_sidebar
permalink: gcs_hoodie.html
toc: true
summary: In this page, we go over how to configure hudi with Google Cloud Storage.
-->
For Hudi storage on GCS, **regional** buckets provide an DFS API with strong consistency.
## GCS Configs
There are two configurations required for Hudi GCS compatibility:
- Adding GCS Credentials for Hudi
- Adding required jars to classpath
### GCS Credentials
Add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your GCS bucket name and Hudi should be able to read/write from the bucket.
```xml
<property>
<name>fs.defaultFS</name>
<value>gs://hudi-bucket</value>
</property>
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>The AbstractFileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.gs.project.id</name>
<value>GCS_PROJECT_ID</value>
</property>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>google.cloud.auth.service.account.email</name>
<value>GCS_SERVICE_ACCOUNT_EMAIL</value>
</property>
<property>
<name>google.cloud.auth.service.account.keyfile</name>
<value>GCS_SERVICE_ACCOUNT_KEYFILE</value>
</property>
```
### GCS Libs
GCS hadoop libraries to add to our classpath
- com.google.cloud.bigdataoss:gcs-connector:1.6.0-hadoop2
<?xml version="1.0" encoding="utf-8"?>
<!-- Generator: Adobe Illustrator 19.0.0, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg version="1.1" id="Layer_2" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
viewBox="0 0 7127.6 2890" enable-background="new 0 0 7127.6 2890" xml:space="preserve">
<path fill="#6D6E71" d="M7104.7,847.8c15.3,15.3,22.9,33.7,22.9,55.2c0,21.5-7.6,39.9-22.9,55.4c-15.3,15.4-33.8,23.1-55.6,23.1
c-21.8,0-40.2-7.6-55.4-22.9c-15.1-15.3-22.7-33.7-22.7-55.2c0-21.5,7.6-39.9,22.9-55.4c15.3-15.4,33.7-23.1,55.4-23.1
C7070.9,824.9,7089.4,832.5,7104.7,847.8z M7098.1,951.9c13.3-13.6,20-29.8,20-48.7s-6.6-35-19.8-48.5
c-13.2-13.4-29.4-20.1-48.6-20.1c-19.2,0-35.4,6.7-48.7,20.2c-13.3,13.5-19.9,29.7-19.9,48.7c0,19,6.6,35.2,19.7,48.6
c13.1,13.4,29.3,20.1,48.5,20.1S7084.7,965.4,7098.1,951.9z M7087.1,888.1c0,14-6.1,22.8-18.4,26.4l22.5,30.5h-18.2l-20.3-28.3
h-18.6v28.3h-14.7v-84.6h31.8c12.8,0,22,2.2,27.6,6.6C7084.4,871.4,7087.1,878.4,7087.1,888.1z M7068.2,900c3-2.4,4.4-6.5,4.4-12
c0-5.5-1.5-9.4-4.5-11.6c-3-2.2-8.4-3.2-16-3.2h-18v30.5h17.5C7059.7,903.6,7065.3,902.4,7068.2,900z"/>
<path fill="#6D6E71" d="M1803.6,499.8v155.4h-20V499.8h-56.8v-19.2h133.9v19.2H1803.6z"/>
<path fill="#6D6E71" d="M2082.2,655.2v-76.9h-105.2v76.9h-20V480.5h20v78.9h105.2v-78.9h20v174.7H2082.2z"/>
<path fill="#6D6E71" d="M2241.4,499.8v57.4h88.1v19.2h-88.1v59.8h101.8v19h-121.8V480.5H2340v19.2H2241.4z"/>
<path fill="#D22128" d="M1574.5,1852.4l417.3-997.6h80.1l417.3,997.6h-105.4l-129.3-311.9h-448.2l-127.9,311.9H1574.5z M2032.6,970
l-205.1,493.2h404.7L2032.6,970z"/>
<path fill="#D22128" d="M2596.9,1852.4V854.8H3010c171.4,0,295.1,158.8,295.1,313.3c0,163-115.2,316.1-286.6,316.1h-324.6v368.1
H2596.9z M2693.9,1397.1h318.9c118,0,193.9-108.2,193.9-229c0-125.1-92.7-226.2-202.3-226.2h-310.5V1397.1z"/>
<path fill="#D22128" d="M3250.5,1852.4l417.3-997.6h80.1l417.3,997.6h-105.4l-129.3-311.9h-448.2l-127.9,311.9H3250.5z M3708.6,970
l-205.1,493.2h404.7L3708.6,970z"/>
<path fill="#D22128" d="M4637.3,849.1c177,0,306.3,89.9,368.1,217.8l-78.7,47.8c-63.2-132.1-186.9-177-295.1-177
c-238.9,0-369.5,213.6-369.5,414.5c0,220.6,161.6,420.1,373.7,420.1c112.4,0,244.5-56.2,307.7-185.5l81.5,42.1
c-64.6,148.9-241.7,231.8-394.8,231.8c-274,0-466.5-261.3-466.5-514.2C4163.8,1106.3,4336.6,849.1,4637.3,849.1z"/>
<path fill="#D22128" d="M5949.1,854.8v997.6h-98.4v-466.5h-591.5v466.5h-96.9V854.8h96.9v444h591.5v-444H5949.1z"/>
<path fill="#D22128" d="M6844.6,1765.2v87.1h-670.2V854.8H6832v87.1h-560.6v359.7h489v82.9h-489v380.8H6844.6z"/>
<path fill="#6D6E71" d="M1667.6,2063.6c11.8,3.5,22.2,8.3,31,14.2l-10.3,22.6c-9-6-18.6-10.4-28.9-13.4c-10.2-2.9-20-4.4-29.2-4.4
c-13.6,0-24.5,2.4-32.6,7.3c-8.1,4.9-12.2,11.8-12.2,20.7c0,7.6,2.3,14,6.8,19c4.5,5,10.2,8.9,17,11.7c6.8,2.8,16.1,6,28,9.6
c14.4,4.6,26,8.9,34.7,12.9c8.8,4,16.3,9.9,22.5,17.8c6.2,7.8,9.3,18.2,9.3,31c0,11.7-3.2,21.8-9.5,30.6
c-6.3,8.7-15.3,15.5-26.8,20.3c-11.6,4.8-24.9,7.2-40,7.2c-15.1,0-29.7-2.9-43.9-8.7c-14.2-5.8-26.4-13.6-36.6-23.4l10.7-21.6
c9.6,9.4,20.7,16.7,33.3,21.9c12.6,5.2,24.8,7.8,36.8,7.8c15.3,0,27.3-3,36.1-8.9c8.8-5.9,13.2-13.9,13.2-23.9
c0-7.8-2.3-14.3-6.9-19.4c-4.6-5.1-10.3-9-17.1-11.9c-6.8-2.8-16.1-6-28-9.6c-14.2-4.2-25.7-8.3-34.6-12.2
c-8.9-3.9-16.4-9.7-22.5-17.5c-6.1-7.7-9.2-17.9-9.2-30.6c0-10.9,3-20.4,9-28.6c6-8.2,14.6-14.6,25.6-19.1
c11.1-4.5,23.8-6.8,38.2-6.8C1643.8,2058.3,1655.7,2060.1,1667.6,2063.6z"/>
<path fill="#6D6E71" d="M1980.1,2072.8c16.8,9.4,30.2,22.3,40,38.4c9.8,16.2,14.8,33.9,14.8,53.3c0,19.5-4.9,37.4-14.8,53.6
c-9.8,16.3-23.2,29.1-40,38.6c-16.8,9.5-35.3,14.3-55.2,14.3c-20.3,0-38.8-4.7-55.7-14.3c-16.8-9.5-30.2-22.4-40-38.6
c-9.8-16.3-14.8-34.1-14.8-53.6c0-19.5,4.9-37.3,14.8-53.5c9.8-16.2,23.2-29,40-38.3c16.8-9.4,35.4-14,55.7-14
C1944.8,2058.6,1963.2,2063.3,1980.1,2072.8z M1881.9,2092.7c-13.1,7.4-23.6,17.5-31.4,30.1c-7.8,12.6-11.8,26.5-11.8,41.7
c0,15.3,3.9,29.3,11.8,42c7.8,12.7,18.3,22.8,31.4,30.2c13.1,7.4,27.4,11.1,42.9,11.1c15.5,0,29.7-3.7,42.7-11.1
c13-7.4,23.3-17.4,31.1-30.2c7.7-12.7,11.6-26.7,11.6-42s-3.9-29.2-11.6-41.8c-7.7-12.6-18.1-22.6-31.1-30
c-13-7.4-27.2-11.2-42.6-11.2C1909.4,2081.5,1895.1,2085.2,1881.9,2092.7z"/>
<path fill="#6D6E71" d="M2186.5,2082.4v74h98.4v23.2h-98.4v90.2h-24.1v-210.6h133.8v23.2H2186.5z"/>
<path fill="#6D6E71" d="M2491.6,2082.4v187.4h-24.1v-187.4h-68.4v-23.2h161.4v23.2H2491.6z"/>
<path fill="#6D6E71" d="M2871.8,2269.8l-56.8-177.4l-57.6,177.4h-24.5l-70.5-210.6h25.9l57.9,182.7l57.1-182.4l24.1-0.3l57.7,182.7
l57.1-182.7h25l-70.6,210.6H2871.8z"/>
<path fill="#6D6E71" d="M3087.3,2216.6l-23.5,53.2h-25.6l94.4-210.6h25l94.1,210.6h-26.1l-23.5-53.2H3087.3z M3144.5,2086.6
l-46.9,106.8h94.4L3144.5,2086.6z"/>
<path fill="#6D6E71" d="M3461.1,2202.7c-6,0.4-10.7,0.6-14.1,0.6h-56v66.5H3367v-210.6h80c26.2,0,46.6,6.2,61.2,18.5
c14.5,12.3,21.8,29.8,21.8,52.3c0,17.2-4.1,31.7-12.2,43.3c-8.1,11.6-19.8,20-35,25l49.2,71.5h-27.3L3461.1,2202.7z M3491.3,2167.6
c10.3-8.4,15.5-20.8,15.5-37c0-15.9-5.2-27.9-15.5-36c-10.3-8.1-25.1-12.2-44.3-12.2h-56v97.8h56
C3466.2,2180.2,3481,2176,3491.3,2167.6z"/>
<path fill="#6D6E71" d="M3688.3,2082.4v69.2h106.2v23.2h-106.2v72.1h122.8v22.9h-146.9v-210.6h142.9v23.2H3688.3z"/>
<path fill="#6D6E71" d="M4147,2082.4v74h98.4v23.2H4147v90.2h-24.1v-210.6h133.8v23.2H4147z"/>
<path fill="#6D6E71" d="M4523.3,2072.8c16.8,9.4,30.2,22.3,40,38.4c9.8,16.2,14.8,33.9,14.8,53.3c0,19.5-4.9,37.4-14.8,53.6
c-9.8,16.3-23.2,29.1-40,38.6c-16.8,9.5-35.3,14.3-55.2,14.3c-20.3,0-38.8-4.7-55.7-14.3c-16.8-9.5-30.2-22.4-40-38.6
c-9.8-16.3-14.8-34.1-14.8-53.6c0-19.5,4.9-37.3,14.8-53.5c9.8-16.2,23.2-29,40-38.3c16.8-9.4,35.4-14,55.7-14
C4488.1,2058.6,4506.5,2063.3,4523.3,2072.8z M4425.2,2092.7c-13.1,7.4-23.6,17.5-31.4,30.1c-7.8,12.6-11.8,26.5-11.8,41.7
c0,15.3,3.9,29.3,11.8,42c7.8,12.7,18.3,22.8,31.4,30.2c13.1,7.4,27.4,11.1,42.9,11.1c15.5,0,29.7-3.7,42.7-11.1
c13-7.4,23.3-17.4,31.1-30.2c7.7-12.7,11.6-26.7,11.6-42s-3.9-29.2-11.6-41.8c-7.7-12.6-18.1-22.6-31.1-30
c-13-7.4-27.2-11.2-42.6-11.2C4452.6,2081.5,4438.3,2085.2,4425.2,2092.7z"/>
<path fill="#6D6E71" d="M4854.7,2247.7c-15.7,15.5-37.3,23.3-64.8,23.3c-27.7,0-49.4-7.8-65.1-23.3c-15.7-15.5-23.6-37-23.6-64.6
v-124h24.1v124c0,20.3,5.8,36.1,17.3,47.5c11.6,11.4,27.3,17.1,47.3,17.1c20.1,0,35.8-5.7,47.1-17c11.4-11.3,17-27.2,17-47.7v-124
h24.1v124C4878.2,2210.7,4870.4,2232.2,4854.7,2247.7z"/>
<path fill="#6D6E71" d="M5169.5,2269.8l-126.3-169.1v169.1h-24.1v-210.6h25l126.3,169.3v-169.3h23.8v210.6H5169.5z"/>
<path fill="#6D6E71" d="M5478.4,2073.1c16.4,9.3,29.4,21.9,38.9,37.9c9.6,16,14.3,33.9,14.3,53.5s-4.8,37.6-14.3,53.6
c-9.5,16.1-22.6,28.7-39.3,37.9c-16.6,9.2-35.2,13.8-55.5,13.8h-84.3v-210.6h85.2C5443.7,2059.2,5462,2063.8,5478.4,2073.1z
M5362.3,2246.9h61.4c15.5,0,29.6-3.5,42.3-10.6c12.7-7.1,22.8-16.9,30.2-29.5c7.4-12.5,11.1-26.5,11.1-42
c0-15.5-3.8-29.4-11.3-41.9c-7.5-12.5-17.7-22.3-30.6-29.6c-12.8-7.2-27-10.9-42.6-10.9h-60.5V2246.9z"/>
<path fill="#6D6E71" d="M5668.6,2216.6l-23.5,53.2h-25.6l94.4-210.6h25l94.1,210.6H5807l-23.5-53.2H5668.6z M5725.8,2086.6
l-46.9,106.8h94.4L5725.8,2086.6z"/>
<path fill="#6D6E71" d="M5991,2082.4v187.4H5967v-187.4h-68.4v-23.2h161.4v23.2H5991z"/>
<path fill="#6D6E71" d="M6175.9,2269.8v-210.6h24.1v210.6H6175.9z"/>
<path fill="#6D6E71" d="M6493.7,2072.8c16.8,9.4,30.2,22.3,40,38.4c9.8,16.2,14.8,33.9,14.8,53.3c0,19.5-4.9,37.4-14.8,53.6
c-9.8,16.3-23.2,29.1-40,38.6c-16.8,9.5-35.3,14.3-55.2,14.3c-20.3,0-38.8-4.7-55.7-14.3c-16.8-9.5-30.2-22.4-40-38.6
c-9.8-16.3-14.8-34.1-14.8-53.6c0-19.5,4.9-37.3,14.8-53.5c9.8-16.2,23.2-29,40-38.3c16.8-9.4,35.4-14,55.7-14
C6458.5,2058.6,6476.9,2063.3,6493.7,2072.8z M6395.6,2092.7c-13.1,7.4-23.6,17.5-31.4,30.1c-7.8,12.6-11.8,26.5-11.8,41.7
c0,15.3,3.9,29.3,11.8,42c7.8,12.7,18.3,22.8,31.4,30.2c13.1,7.4,27.4,11.1,42.9,11.1c15.5,0,29.7-3.7,42.7-11.1
c13-7.4,23.3-17.4,31.1-30.2c7.7-12.7,11.6-26.7,11.6-42s-3.9-29.2-11.6-41.8c-7.7-12.6-18.1-22.6-31.1-30
c-13-7.4-27.2-11.2-42.6-11.2C6423,2081.5,6408.8,2085.2,6395.6,2092.7z"/>
<path fill="#6D6E71" d="M6826.5,2269.8l-126.3-169.1v169.1h-24.1v-210.6h25l126.3,169.3v-169.3h23.8v210.6H6826.5z"/>
<linearGradient id="SVGID_1_" gradientUnits="userSpaceOnUse" x1="-4516.6152" y1="-2338.7222" x2="-4108.4111" y2="-1861.3982" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0" style="stop-color:#F69923"/>
<stop offset="0.3123" style="stop-color:#F79A23"/>
<stop offset="0.8383" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_1_)" d="M1230.1,13.7c-45.3,26.8-120.6,102.5-210.5,212.3l82.6,155.9c58-82.9,116.9-157.5,176.3-221.2
c4.6-5.1,7-7.5,7-7.5c-2.3,2.5-4.6,5-7,7.5c-19.2,21.2-77.5,89.2-165.5,224.4c84.7-4.2,214.9-21.6,321.1-39.7
c31.6-177-31-258-31-258S1323.4-41.4,1230.1,13.7z"/>
<path fill="none" d="M1090.2,903.1c0.6-0.1,1.2-0.2,1.8-0.3l-11.9,1.3c-0.7,0.3-1.4,0.7-2.1,1
C1082.1,904.4,1086.2,903.7,1090.2,903.1z"/>
<path fill="none" d="M1005.9,1182.3c-6.7,1.5-13.7,2.7-20.7,3.7C992.3,1185,999.2,1183.8,1005.9,1182.3z"/>
<path fill="none" d="M432.9,1808.8c0.9-2.3,1.8-4.7,2.6-7c18.2-48,36.2-94.7,54-140.1c20-51,39.8-100.4,59.3-148.3
c20.6-50.4,40.9-99.2,60.9-146.3c21-49.4,41.7-97,62-142.8c16.5-37.3,32.8-73.4,48.9-108.3c5.4-11.7,10.7-23.2,16-34.6
c10.5-22.7,21-44.8,31.3-66.5c9.5-20,19-39.6,28.3-58.8c3.1-6.4,6.2-12.8,9.3-19.1c0.5-1,1-2,1.5-3.1l-10.2,1.1l-8-15.9
c-0.8,1.6-1.6,3.1-2.4,4.6c-14.5,28.8-28.9,57.9-43.1,87.2c-8.2,16.9-16.4,34-24.6,51c-22.6,47.4-44.8,95.2-66.6,143.3
c-22.1,48.6-43.7,97.5-64.9,146.5c-20.8,48.1-41.3,96.2-61.2,144.2c-20,48-39.5,95.7-58.5,143.2c-19.9,49.5-39.2,98.7-58,147.2
c-4.2,10.9-8.5,21.9-12.7,32.8c-15,39.2-29.7,77.8-44,116l12.7,25.1l11.4-1.2c0.4-1.1,0.8-2.3,1.3-3.4
C396.7,1905.4,414.9,1856.4,432.9,1808.8z"/>
<path fill="none" d="M980,1186.8L980,1186.8c0.1,0,0.1,0,0.1-0.1C980.1,1186.8,980.1,1186.8,980,1186.8z"/>
<path fill="#BE202E" d="M952.6,1323c-10.6,1.9-21.4,3.8-32.5,5.7c-0.1,0-0.1,0.1-0.2,0.1c5.6-0.8,11.2-1.7,16.6-2.6
C942,1325.2,947.3,1324.1,952.6,1323z"/>
<path opacity="0.35" fill="#BE202E" d="M952.6,1323c-10.6,1.9-21.4,3.8-32.5,5.7c-0.1,0-0.1,0.1-0.2,0.1c5.6-0.8,11.2-1.7,16.6-2.6
C942,1325.2,947.3,1324.1,952.6,1323z"/>
<path fill="#BE202E" d="M980.3,1186.7C980.2,1186.7,980.2,1186.7,980.3,1186.7c-0.1,0.1-0.2,0.1-0.2,0.1c1.8-0.2,3.5-0.5,5.2-0.8
c7-1,13.9-2.2,20.7-3.7C997.5,1183.8,989,1185.2,980.3,1186.7L980.3,1186.7L980.3,1186.7z"/>
<path opacity="0.35" fill="#BE202E" d="M980.3,1186.7C980.2,1186.7,980.2,1186.7,980.3,1186.7c-0.1,0.1-0.2,0.1-0.2,0.1
c1.8-0.2,3.5-0.5,5.2-0.8c7-1,13.9-2.2,20.7-3.7C997.5,1183.8,989,1185.2,980.3,1186.7L980.3,1186.7L980.3,1186.7z"/>
<linearGradient id="SVGID_2_" gradientUnits="userSpaceOnUse" x1="-7537.7339" y1="-2391.4075" x2="-4625.4141" y2="-2391.4075" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_2_)" d="M858.6,784.7c25.1-46.9,50.5-92.8,76.2-137.4c26.7-46.4,53.7-91.3,80.9-134.7
c1.6-2.6,3.2-5.2,4.8-7.7c27-42.7,54.2-83.7,81.6-122.9L1019.5,226c-6.2,7.6-12.5,15.3-18.8,23.2c-23.8,29.7-48.6,61.6-73.9,95.5
c-28.6,38.2-58,78.9-87.8,121.7c-27.6,39.5-55.5,80.9-83.5,123.7c-23.8,36.5-47.7,74-71.4,112.5c-0.9,1.4-1.8,2.9-2.6,4.3
l107.5,212.3C811.8,873.6,835.1,828.7,858.6,784.7z"/>
<linearGradient id="SVGID_3_" gradientUnits="userSpaceOnUse" x1="-7186.1777" y1="-2099.3059" x2="-5450.7183" y2="-2099.3059" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0" style="stop-color:#282662"/>
<stop offset="9.548390e-02" style="stop-color:#662E8D"/>
<stop offset="0.7882" style="stop-color:#9F2064"/>
<stop offset="0.9487" style="stop-color:#CD2032"/>
</linearGradient>
<path fill="url(#SVGID_3_)" d="M369,1981c-14.2,39.1-28.5,78.9-42.9,119.6c-0.2,0.6-0.4,1.2-0.6,1.8c-2,5.7-4.1,11.5-6.1,17.2
c-9.7,27.4-18,52.1-37.3,108.2c31.7,14.5,57.1,52.5,81.1,95.6c-2.6-44.7-21-86.6-56.2-119.1c156.1,7,290.6-32.4,360.1-146.6
c6.2-10.2,11.9-20.9,17-32.2c-31.6,40.1-70.8,57.1-144.5,53c-0.2,0.1-0.3,0.1-0.5,0.2c0.2-0.1,0.3-0.1,0.5-0.2
c108.6-48.6,163.1-95.3,211.2-172.6c11.4-18.3,22.5-38.4,33.8-60.6c-94.9,97.5-205,125.3-320.9,104.2l-86.9,9.5
C374.4,1966.3,371.7,1973.6,369,1981z"/>
<linearGradient id="SVGID_4_" gradientUnits="userSpaceOnUse" x1="-7374.1626" y1="-2418.5454" x2="-4461.8428" y2="-2418.5454" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_4_)" d="M409.6,1786.3c18.8-48.5,38.1-97.7,58-147.2c19-47.4,38.5-95.2,58.5-143.2
c20-48,40.4-96.1,61.2-144.2c21.2-49,42.9-97.8,64.9-146.5c21.8-48.1,44-95.9,66.6-143.3c8.1-17.1,16.3-34.1,24.6-51
c14.2-29.3,28.6-58.4,43.1-87.2c0.8-1.6,1.6-3.1,2.4-4.6L681.4,706.8c-1.8,2.9-3.5,5.8-5.3,8.6c-25.1,40.9-50,82.7-74.4,125.4
c-24.7,43.1-49,87.1-72.7,131.7c-20,37.6-39.6,75.6-58.6,113.9c-3.8,7.8-7.6,15.5-11.3,23.2c-23.4,48.2-44.6,94.8-63.7,139.5
c-21.7,50.7-40.7,99.2-57.5,145.1c-11,30.2-21,59.4-30.1,87.4c-7.5,24-14.7,47.9-21.5,71.8c-16,56.3-29.9,112.4-41.2,168.3
L353,1935.1c14.3-38.1,28.9-76.8,44-116C401.1,1808.2,405.4,1797.3,409.6,1786.3z"/>
<linearGradient id="SVGID_5_" gradientUnits="userSpaceOnUse" x1="-7161.7642" y1="-2379.1431" x2="-5631.2524" y2="-2379.1431" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0" style="stop-color:#282662"/>
<stop offset="9.548390e-02" style="stop-color:#662E8D"/>
<stop offset="0.7882" style="stop-color:#9F2064"/>
<stop offset="0.9487" style="stop-color:#CD2032"/>
</linearGradient>
<path fill="url(#SVGID_5_)" d="M243.5,1729.4c-13.6,68.2-23.2,136.2-28,203.8c-0.2,2.4-0.4,4.7-0.5,7.1
c-33.7-54-124-106.8-123.8-106.2c64.6,93.7,113.7,186.7,120.9,278c-34.6,7.1-82-3.2-136.8-23.3c57.1,52.5,100,67,116.7,70.9
c-52.5,3.3-107.1,39.3-162.1,80.8c80.5-32.8,145.5-45.8,192.1-35.3C148.1,2414.2,74.1,2645,0,2890c22.7-6.7,36.2-21.9,43.9-42.6
c13.2-44.4,100.8-335.6,238-718.2c3.9-10.9,7.8-21.8,11.8-32.9c1.1-3,2.2-6.1,3.3-9.2c14.5-40.1,29.5-81.1,45.1-122.9
c3.5-9.5,7.1-19,10.7-28.6c0.1-0.2,0.1-0.4,0.2-0.6l-107.9-213.2C244.6,1724.4,244,1726.9,243.5,1729.4z"/>
<linearGradient id="SVGID_6_" gradientUnits="userSpaceOnUse" x1="-7374.1626" y1="-2117.1309" x2="-4461.8428" y2="-2117.1309" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_6_)" d="M805.6,937c-3.1,6.3-6.2,12.7-9.3,19.1c-9.3,19.2-18.8,38.8-28.3,58.8
c-10.3,21.7-20.7,43.9-31.3,66.5c-5.3,11.4-10.6,22.9-16,34.6c-16.1,35-32.4,71.1-48.9,108.3c-20.3,45.8-41,93.4-62,142.8
c-20,47.1-40.3,95.9-60.9,146.3c-19.5,47.9-39.3,97.3-59.3,148.3c-17.8,45.4-35.9,92.1-54,140.1c-0.9,2.3-1.8,4.7-2.6,7
c-18,47.6-36.2,96.6-54.6,146.8c-0.4,1.1-0.8,2.3-1.3,3.4l86.9-9.5c-1.7-0.3-3.5-0.5-5.2-0.9c103.9-13,242.1-90.6,331.4-186.5
c41.1-44.2,78.5-96.3,113-157.3c25.7-45.4,49.8-95.8,72.8-151.5c20.1-48.7,39.4-101.4,58-158.6c-23.9,12.6-51.2,21.8-81.4,28.2
c-5.3,1.1-10.7,2.2-16.1,3.1c-5.5,1-11,1.8-16.6,2.6l0,0l0,0c0.1,0,0.1-0.1,0.2-0.1c96.9-37.3,158-109.2,202.4-197.4
c-25.5,17.4-66.9,40.1-116.6,51.1c-6.7,1.5-13.7,2.7-20.7,3.7c-1.7,0.3-3.5,0.6-5.2,0.8l0,0l0,0c0.1,0,0.1,0,0.1-0.1
c0,0,0.1,0,0.1,0l0,0c33.6-14.1,62-29.8,86.6-48.4c5.3-4,10.4-8.1,15.3-12.3c7.5-6.5,14.7-13.3,21.5-20.5c4.4-4.6,8.6-9.3,12.7-14.2
c9.6-11.5,18.7-23.9,27.1-37.3c2.6-4.1,5.1-8.3,7.6-12.6c3.2-6.2,6.3-12.3,9.3-18.3c13.5-27.2,24.4-51.5,33-72.8
c4.3-10.6,8.1-20.5,11.3-29.7c1.3-3.7,2.5-7.2,3.7-10.6c3.4-10.2,6.2-19.3,8.4-27.3c3.3-12,5.3-21.5,6.4-28.4l0,0l0,0
c-3.3,2.6-7.1,5.2-11.3,7.7c-29.3,17.5-79.5,33.4-119.9,40.8l79.8-8.8l-79.8,8.8c-0.6,0.1-1.2,0.2-1.8,0.3c-4,0.7-8.1,1.3-12.2,2
c0.7-0.3,1.4-0.7,2.1-1l-273,29.9C806.6,935,806.1,936,805.6,937z"/>
<linearGradient id="SVGID_7_" gradientUnits="userSpaceOnUse" x1="-7554.8232" y1="-2132.0981" x2="-4642.5034" y2="-2132.0981" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_7_)" d="M1112.9,385.1c-24.3,37.3-50.8,79.6-79.4,127.5c-1.5,2.5-3,5.1-4.5,7.6
c-24.6,41.5-50.8,87.1-78.3,137c-23.8,43.1-48.5,89.3-74.3,139c-22.4,43.3-45.6,89.2-69.4,137.8l273-29.9
c79.5-36.6,115.1-69.7,149.6-117.6c9.2-13.2,18.4-27,27.5-41.3c28-43.8,55.6-92,80.1-139.9c23.7-46.3,44.7-92.2,60.7-133.5
c10.2-26.3,18.4-50.8,24.1-72.3c5-19,8.9-36.9,11.9-54.1C1327.9,363.5,1197.6,380.9,1112.9,385.1z"/>
<path fill="#BE202E" d="M936.5,1326.1c-5.5,1-11,1.8-16.6,2.6l0,0C925.5,1328,931,1327.1,936.5,1326.1z"/>
<path opacity="0.35" fill="#BE202E" d="M936.5,1326.1c-5.5,1-11,1.8-16.6,2.6l0,0C925.5,1328,931,1327.1,936.5,1326.1z"/>
<linearGradient id="SVGID_8_" gradientUnits="userSpaceOnUse" x1="-7374.1626" y1="-2027.484" x2="-4461.8433" y2="-2027.484" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_8_)" d="M936.5,1326.1c-5.5,1-11,1.8-16.6,2.6l0,0C925.5,1328,931,1327.1,936.5,1326.1z"/>
<path fill="#BE202E" d="M980,1186.8c1.8-0.2,3.5-0.5,5.2-0.8C983.5,1186.3,981.8,1186.6,980,1186.8L980,1186.8z"/>
<path opacity="0.35" fill="#BE202E" d="M980,1186.8c1.8-0.2,3.5-0.5,5.2-0.8C983.5,1186.3,981.8,1186.6,980,1186.8L980,1186.8z"/>
<linearGradient id="SVGID_9_" gradientUnits="userSpaceOnUse" x1="-7374.1626" y1="-2037.7417" x2="-4461.8433" y2="-2037.7417" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_9_)" d="M980,1186.8c1.8-0.2,3.5-0.5,5.2-0.8C983.5,1186.3,981.8,1186.6,980,1186.8L980,1186.8z"/>
<path fill="#BE202E" d="M980.2,1186.7C980.2,1186.7,980.2,1186.7,980.2,1186.7L980.2,1186.7L980.2,1186.7L980.2,1186.7
C980.2,1186.7,980.2,1186.7,980.2,1186.7z"/>
<path opacity="0.35" fill="#BE202E" d="M980.2,1186.7C980.2,1186.7,980.2,1186.7,980.2,1186.7L980.2,1186.7L980.2,1186.7
L980.2,1186.7C980.2,1186.7,980.2,1186.7,980.2,1186.7z"/>
<linearGradient id="SVGID_10_" gradientUnits="userSpaceOnUse" x1="-5738.0635" y1="-2039.799" x2="-5094.3457" y2="-2039.799" gradientTransform="matrix(0.4226 -0.9063 0.9063 0.4226 5117.8774 -2859.9343)">
<stop offset="0.3233" style="stop-color:#9E2064"/>
<stop offset="0.6302" style="stop-color:#C92037"/>
<stop offset="0.7514" style="stop-color:#CD2335"/>
<stop offset="1" style="stop-color:#E97826"/>
</linearGradient>
<path fill="url(#SVGID_10_)" d="M980.2,1186.7C980.2,1186.7,980.2,1186.7,980.2,1186.7L980.2,1186.7L980.2,1186.7L980.2,1186.7
C980.2,1186.7,980.2,1186.7,980.2,1186.7z"/>
</svg>
<!-- index.html -->
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="viewport" content="width=device-width,initial-scale=1">
<meta charset="UTF-8">
<link rel="stylesheet" href="asset/vue.css">
<link rel="stylesheet" href="asset/style.css">
<link rel="stylesheet" href="asset/prism-darcula.css">
<!-- google ads -->
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- google webmaster -->
<meta name="google-site-verification" content="pyo9N70ZWyh8JB43bIu633mhxesJ1IcwWCZlM3jUfFo" />
</head>
<body>
<div id="app">now loading...</div>
<script>
window.$docsify = {
loadSidebar: 'SUMMARY.md',
name: 'Hudi 0.5.0 中文文档',
auto2top: true,
themeColor: '#003767',
repo: 'apachecn/hudi-doc-zh',
plugins: [window.docsPlugin],
bdStatId: '38525fdac4b5d4403900b943d4e7dd91',
cnzzId: '1275211409',
search: {
paths: 'auto',
placeholder: '搜索',
noData: '没有结果',
},
copyCode: {
buttonText: '复制',
errorText: 'Error',
successText: 'OK!',
},
}
</script>
<script src="asset/docsify.min.js"></script>
<script src="asset/docsify-copy-code.min.js"></script>
<script src="asset/search.min.js"></script>
<script src="asset/docsify-baidu-push.js"></script>
<script src="asset/docsify-baidu-stat.js"></script>
<script src="asset/docsify-cnzz.js"></script>
<script src="asset/docsify-apachecn-footer.js"></script>
<script src="asset/docsify-clicker.js"></script>
</body>
</html>
\ No newline at end of file
# Migration Guide
<!--
title: Migration Guide
keywords: hudi, migration, use case
sidebar: mydoc_sidebar
permalink: migration_guide.html
toc: false
summary: In this page, we will discuss some available tools for migrating your existing dataset into a Hudi dataset
-->
Hudi maintains metadata such as commit timeline and indexes to manage a dataset. The commit timelines helps to understand the actions happening on a dataset as well as the current state of a dataset. Indexes are used by Hudi to maintain a record key to file id mapping to efficiently locate a record. At the moment, Hudi supports writing only parquet columnar formats.
To be able to start using Hudi for your existing dataset, you will need to migrate your existing dataset into a Hudi managed dataset. There are a couple of ways to achieve this.
## Approaches
#### Use Hudi for new partitions alone
Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the
dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete
Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive
partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing
to a new partition or convert your last N partitions into Hudi instead of the entire table. Note, since the historical
partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset.
Take this approach if your dataset is an append only type of dataset and you do not expect to perform any updates to existing (or non Hudi managed) partitions.
#### Convert existing dataset to Hudi
Import your existing dataset into a Hudi managed dataset. Since all the data is Hudi managed, none of the limitations
of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently
make the update available to queries. Note that not only do you get to use all Hudi primitives on this dataset,
there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset
. You can define the desired file size when converting this dataset and Hudi will ensure it writes out files
adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into
small files rather than writing new small ones thus maintaining the health of your cluster.
There are a few options when choosing this approach.
#### Option 1
Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format.
This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data.
#### Option 2
For huge datasets, this could be as simple as :
```java
for partition in [list of partitions in source dataset] {
val inputDF = spark.read.format("any_input_format").load("partition_path")
inputDF.write.format("org.apache.hudi").option()....save("basePath")
}
```
#### Option 3
Write your own custom logic of how to load an existing dataset into a Hudi managed one. Please read about the RDD API
[here](quickstart.html).
```Java
Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
fired by via `cd hudi-cli && ./hudi-cli.sh`.
hudi->hdfsparquetimport
--upsert false
--srcPath /user/parquet/dataset/basepath
--targetPath
/user/hoodie/dataset/basepath
--tableName hoodie_table
--tableType COPY_ON_WRITE
--rowKeyField _row_key
--partitionPathField partitionStr
--parallelism 1500
--schemaFilePath /user/table/schema
--format parquet
--sparkMemory 6g
--retry 2
```
# 性能
<!--
title: 性能
keywords: hudi, index, storage, compaction, cleaning, implementation
sidebar: mydoc_sidebar
toc: true
permalink: performance.html
-->
在本节中,我们将介绍一些有关Hudi插入更新、增量提取的实际性能数据,并将其与实现这些任务的其它传统工具进行比较。
## 插入更新
下面显示了从NoSQL数据库摄取获得的速度提升,这些速度提升数据是通过在写入时复制存储上的Hudi数据集上插入更新而获得的,
数据集包括5个从小到大的表(相对于批量加载表)。
<figure>
<img class="docimage" src="../images/hudi_upsert_perf1.png" alt="hudi_upsert_perf1.png" style="max-width: 1000px" />
</figure>
由于Hudi可以通过增量构建数据集,它也为更频繁地调度摄取提供了可能性,从而减少了延迟,并显著节省了总体计算成本。
<figure>
<img class="docimage" src="../images/hudi_upsert_perf2.png" alt="hudi_upsert_perf2.png" style="max-width: 1000px" />
</figure>
Hudi插入更新在t1表的一次提交中就进行了高达4TB的压力测试。
有关一些调优技巧,请参见[这里](https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide)
## 索引
为了有效地插入更新数据,Hudi需要将要写入的批量数据中的记录分类为插入和更新(并标记它所属的文件组)。
为了加快此操作的速度,Hudi采用了可插拔索引机制,该机制存储了recordKey和它所属的文件组ID之间的映射。
默认情况下,Hudi使用内置索引,该索引使用文件范围和布隆过滤器来完成此任务,相比于Spark Join,其速度最高可提高10倍。
当您将recordKey建模为单调递增时(例如时间戳前缀),Hudi提供了最佳的索引性能,从而进行范围过滤来避免与许多文件进行比较。
即使对于基于UUID的键,也有[已知技术](https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/)来达到同样目的。
例如,在具有80B键、3个分区、11416个文件、10TB数据的事件表上使用100M个时间戳前缀的键(5%的更新,95%的插入)时,
相比于原始Spark Join,Hudi索引速度的提升**约为7倍(440秒相比于2880秒)**
即使对于具有挑战性的工作负载,如使用300个核对3.25B UUID键、30个分区、6180个文件的“100%更新”的数据库摄取工作负载,Hudi索引也可以提供**80-100%的加速**
## 读优化查询
读优化视图的主要设计目标是在不影响查询的情况下实现上一节中提到的延迟减少和效率提高。
下图比较了对Hudi和非Hudi数据集的Hive、Presto、Spark查询,并对此进行说明。
**Hive**
<figure>
<img class="docimage" src="../images/hudi_query_perf_hive.png" alt="hudi_query_perf_hive.png" style="max-width: 800px" />
</figure>
**Spark**
<figure>
<img class="docimage" src="../images/hudi_query_perf_spark.png" alt="hudi_query_perf_spark.png" style="max-width: 1000px" />
</figure>
**Presto**
<figure>
<img class="docimage" src="../images/hudi_query_perf_presto.png" alt="hudi_query_perf_presto.png" style="max-width: 1000px" />
</figure>
# 演讲 & Hudi 用户
<!--
title: 演讲 & Hudi 用户
keywords: hudi, talks, presentation
sidebar: mydoc_sidebar
permalink: powered_by.html
toc: false
-->
## 已使用
#### Uber
Hudi最初由[Uber](https://uber.com)开发,用于实现[低延迟、高效率的数据库摄取](http://www.slideshare.net/vinothchandar/hadoop-strata-talk-uber-your-hadoop-has-arrived/32)
Hudi自2016年8月开始在生产环境上线,在Hadoop上驱动约100个非常关键的业务表,支撑约几百TB的数据规模(前10名包括行程、乘客、司机)。
Hudi还支持几个增量的Hive ETL管道,并且目前已集成到Uber的数据分发系统中。
#### EMIS Health
[EMIS Health](https://www.emishealth.com/)是英国最大的初级保健IT软件提供商,其数据集包括超过5000亿的医疗保健记录。HUDI用于管理生产中的分析数据集,并使其与上游源保持同步。Presto用于查询以HUDI格式写入的数据。
#### Yields.io
Yields.io是第一个使用AI在企业范围内进行自动模型验证和实时监控的金融科技平台。他们的数据湖由Hudi管理,他们还积极使用Hudi为增量式、跨语言/平台机器学习构建基础架构。
#### Yotpo
Hudi在Yotpo有不少用途。首先,在他们的[开源ETL框架](https://github.com/YotpoLtd/metorikku)中集成了Hudi作为CDC管道的输出写入程序,即从数据库binlog生成的事件流到Kafka然后再写入S3。
## 演讲 & 报告
1. ["Hoodie: Incremental processing on Hadoop at Uber"](https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/56511) - By Vinoth Chandar & Prasanna Rajaperumal
Mar 2017, Strata + Hadoop World, San Jose, CA
2. ["Hoodie: An Open Source Incremental Processing Framework From Uber"](http://www.dataengconf.com/hoodie-an-open-source-incremental-processing-framework-from-uber) - By Vinoth Chandar.
Apr 2017, DataEngConf, San Francisco, CA [Slides](https://www.slideshare.net/vinothchandar/hoodie-dataengconf-2017) [Video](https://www.youtube.com/watch?v=7Wudjc-v7CA)
3. ["Incremental Processing on Large Analytical Datasets"](https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/) - By Prasanna Rajaperumal
June 2017, Spark Summit 2017, San Francisco, CA. [Slides](https://www.slideshare.net/databricks/incremental-processing-on-large-analytical-datasets-with-prasanna-rajaperumal-and-vinoth-chandar) [Video](https://www.youtube.com/watch?v=3HS0lQX-cgo&feature=youtu.be)
4. ["Hudi: Unifying storage and serving for batch and near-real-time analytics"](https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/70937) - By Nishith Agarwal & Balaji Vardarajan
September 2018, Strata Data Conference, New York, NY
5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks.com/session/hudi-near-real-time-spark-pipelines-at-petabyte-scale) - By Vinoth Chandar & Nishith Agarwal
October 2018, Spark+AI Summit Europe, London, UK
6. ["Powering Uber's global network analytics pipelines in real-time with Apache Hudi"](https://www.youtube.com/watch?v=1w3IpavhSWA) - By Ethan Guo & Nishith Agarwal, April 2019, Data Council SF19, San Francisco, CA.
7. ["Building highly efficient data lakes using Apache Hudi (Incubating)"](https://www.slideshare.net/ChesterChen/sf-big-analytics-20190612-building-highly-efficient-data-lakes-using-apache-hudi) - By Vinoth Chandar
June 2019, SF Big Analytics Meetup, San Mateo, CA
8. ["Apache Hudi (Incubating) - The Past, Present and Future Of Efficient Data Lake Architectures"](https://docs.google.com/presentation/d/1FHhsvh70ZP6xXlHdVsAI0g__B_6Mpto5KQFlZ0b8-mM) - By Vinoth Chandar & Balaji Varadarajan
September 2019, ApacheCon NA 19, Las Vegas, NV, USA
## 文章
1. ["The Case for incremental processing on Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop) - O'reilly Ideas article by Vinoth Chandar
2. ["Hoodie: Uber Engineering's Incremental Processing Framework on Hadoop"](https://eng.uber.com/hoodie/) - Engineering Blog By Prasanna Rajaperumal
# 查询 Hudi 数据集
<!--
title: 查询 Hudi 数据集
keywords: hudi, hive, spark, sql, presto
sidebar: mydoc_sidebar
permalink: querying_data.html
toc: true
summary: 在这一页里,我们介绍了如何在Hudi构建的表上启用SQL查询。
-->
从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如[之前](concepts.html#views)所述。
数据集同步到Hive Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包,
就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。
具体来说,在写入过程中传递了两个由[table name](configurations.html#TABLE_NAME_OPT_KEY)命名的Hive表。
例如,如果`table name = hudi_tbl`,我们得到
- `hudi_tbl` 实现了由 `HoodieParquetInputFormat` 支持的数据集的读优化视图,从而提供了纯列式数据。
- `hudi_tbl_rt` 实现了由 `HoodieParquetRealtimeInputFormat` 支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。
如概念部分所述,[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)所需要的
一个关键原语是`增量拉取`(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起,
您可以只获得全部更新和新行。 这与插入更新一起使用,对于构建某些数据管道尤其有用,包括将1个或多个源Hudi表(数据流/事实)以增量方式拉出(流/事实)
并与其他表(数据集/维度)结合以[写出增量](write_data.html)到目标Hudi数据集。增量视图是通过查询上表之一实现的,并具有特殊配置,
该特殊配置指示查询计划仅需要从数据集中获取增量数据。
接下来,我们将详细讨论在每个查询引擎上如何访问所有三个视图。
## Hive
为了使Hive能够识别Hudi数据集并正确查询,
HiveServer2需要在其[辅助jars路径](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr)中提供`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar`
这将确保输入格式类及其依赖项可用于查询计划和执行。
### 读优化表 {#hive-ro-view}
除了上述设置之外,对于beeline cli访问,还需要将`hive.input.format`变量设置为`org.apache.hudi.hadoop.HoodieParquetInputFormat`输入格式的完全限定路径名。
对于Tez,还需要将`hive.tez.input.format`设置为`org.apache.hadoop.hive.ql.io.HiveInputFormat`
### 实时表 {#hive-rt-view}
除了在HiveServer2上安装Hive捆绑jars之外,还需要将其放在整个集群的hadoop/hive安装中,这样查询也可以使用自定义RecordReader。
### 增量拉取 {#hive-incr-pull}
`HiveIncrementalPuller`允许通过HiveQL从大型事实/维表中增量提取更改,
结合了Hive(可靠地处理复杂的SQL查询)和增量原语的好处(通过增量拉取而不是完全扫描来加快查询速度)。
该工具使用Hive JDBC运行hive查询并将其结果保存在临时表中,这个表可以被插入更新。
Upsert实用程序(`HoodieDeltaStreamer`)具有目录结构所需的所有状态,以了解目标表上的提交时间应为多少。
例如:`/app/incremental-hql/intermediate/{source_table_name}_temp/{last_commit_included}`
已注册的Delta Hive表的格式为`{tmpdb}.{source_table}_{last_commit_included}`
以下是HiveIncrementalPuller的配置选项
| **配置** | **描述** | **默认值** |
|hiveUrl| 要连接的Hive Server 2的URL | |
|hiveUser| Hive Server 2 用户名 | |
|hivePass| Hive Server 2 密码 | |
|queue| YARN 队列名称 | |
|tmp| DFS中存储临时增量数据的目录。目录结构将遵循约定。请参阅以下部分。 | |
|extractSQLFile| 在源表上要执行的提取数据的SQL。提取的数据将是自特定时间点以来已更改的所有行。 | |
|sourceTable| 源表名称。在Hive环境属性中需要设置。 | |
|targetTable| 目标表名称。中间存储目录结构需要。 | |
|sourceDataPath| 源DFS基本路径。这是读取Hudi元数据的地方。 | |
|targetDataPath| 目标DFS基本路径。 这是计算fromCommitTime所必需的。 如果显式指定了fromCommitTime,则不需要设置这个参数。 | |
|tmpdb| 用来创建中间临时增量表的数据库 | hoodie_temp |
|fromCommitTime| 这是最重要的参数。 这是从中提取更改的记录的时间点。 | |
|maxCommits| 要包含在拉取中的提交数。将此设置为-1将包括从fromCommitTime开始的所有提交。将此设置为大于0的值,将包括在fromCommitTime之后仅更改指定提交次数的记录。如果您需要一次赶上两次提交,则可能需要这样做。| 3 |
|help| 实用程序帮助 | |
设置fromCommitTime=0和maxCommits=-1将提取整个源数据集,可用于启动Backfill。
如果目标数据集是Hudi数据集,则该实用程序可以确定目标数据集是否没有提交或延迟超过24小时(这是可配置的),
它将自动使用Backfill配置,因为增量应用最近24小时的更改会比Backfill花费更多的时间。
该工具当前的局限性在于缺乏在混合模式(正常模式和增量模式)下自联接同一表的支持。
**关于使用Fetch任务执行的Hive查询的说明:**
由于Fetch任务为每个分区调用InputFormat.listStatus(),每个listStatus()调用都会列出Hoodie元数据。
为了避免这种情况,如下操作可能是有用的,即使用Hive session属性对增量查询禁用Fetch任务:
`set hive.fetch.task.conversion = none;`。这将确保Hive查询使用Map Reduce执行,
合并分区(用逗号分隔),并且对所有这些分区仅调用一次InputFormat.listStatus()。
## Spark
Spark可将Hudi jars和捆绑包轻松部署和管理到作业/笔记本中。简而言之,通过Spark有两种方法可以访问Hudi数据集。
- **Hudi DataSource**:支持读取优化和增量拉取,类似于标准数据源(例如:`spark.read.parquet`)的工作方式。
- **以Hive表读取**:支持所有三个视图,包括实时视图,依赖于自定义的Hudi输入格式(再次类似Hive)。
通常,您的spark作业需要依赖`hudi-spark``hudi-spark-bundle-x.y.z.jar`
它们必须位于驱动程序和执行程序的类路径上(提示:使用`--jars`参数)。
### 读优化表 {#spark-ro-view}
要使用SparkSQL将RO表读取为Hive表,只需按如下所示将路径过滤器推入sparkContext。
对于Hudi表,该方法保留了Spark内置的读取Parquet文件的优化功能,例如进行矢量化读取。
```Scala
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]);
```
如果您希望通过数据源在DFS上使用全局路径,则只需执行以下类似操作即可得到Spark DataFrame。
```Scala
Dataset<Row> hoodieROViewDF = spark.read().format("org.apache.hudi")
// pass any path glob, can include hudi & non-hudi datasets
.load("/glob/path/pattern");
```
### 实时表 {#spark-rt-view}
当前,实时表只能在Spark中作为Hive表进行查询。为了做到这一点,设置`spark.sql.hive.convertMetastoreParquet = false`
迫使Spark回退到使用Hive Serde读取数据(计划/执行仍然是Spark)。
```Scala
$ spark-shell --jars hudi-spark-bundle-x.y.z-SNAPSHOT.jar --driver-class-path /etc/hive/conf --packages com.databricks:spark-avro_2.11:4.0.0 --conf spark.sql.hive.convertMetastoreParquet=false --num-executors 10 --driver-memory 7g --executor-memory 2g --master yarn-client
scala> sqlContext.sql("select count(*) from hudi_rt where datestr = '2016-10-02'").show()
```
### 增量拉取 {#spark-incr-pull}
`hudi-spark`模块提供了DataSource API,这是一种从Hudi数据集中提取数据并通过Spark处理数据的更优雅的方法。
如下所示是一个示例增量拉取,它将获取自`beginInstantTime`以来写入的所有记录。
```Java
Dataset<Row> hoodieIncViewDF = spark.read()
.format("org.apache.hudi")
.option(DataSourceReadOptions.VIEW_TYPE_OPT_KEY(),
DataSourceReadOptions.VIEW_TYPE_INCREMENTAL_OPT_VAL())
.option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY(),
<beginInstantTime>)
.load(tablePath); // For incremental view, pass in the root/base path of dataset
```
请参阅[设置](configurations.html#spark-datasource)部分,以查看所有数据源选项。
另外,`HoodieReadClient`通过Hudi的隐式索引提供了以下功能。
| **API** | **描述** |
| read(keys) | 使用Hudi自己的索通过快速查找将与键对应的数据作为DataFrame读出 |
| filterExists() | 从提供的RDD[HoodieRecord]中过滤出已经存在的记录。对删除重复数据有用 |
| checkExists(keys) | 检查提供的键是否存在于Hudi数据集中 |
## Presto
Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。
这需要在整个安装过程中将`hudi-presto-bundle` jar放入`<presto_install>/plugin/hive-hadoop2/`中。
# 快速开始
<!--
title: 快速开始
keywords: hudi, quickstart
sidebar: mydoc_sidebar
toc: true
permalink: quickstart.html
-->
<br/>
本指南通过使用spark-shell简要介绍了Hudi功能。使用Spark数据源,我们将通过代码段展示如何插入和更新的Hudi默认存储类型数据集:
[写时复制](https://hudi.apache.org/concepts.html#copy-on-write-storage)。每次写操作之后,我们还将展示如何读取快照和增量读取数据。
## 设置spark-shell
Hudi适用于Spark-2.x版本。您可以按照[此处](https://spark.apache.org/downloads.html)的说明设置spark。
在提取的目录中,使用spark-shell运行Hudi:
```Scala
bin/spark-shell --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
```
设置表名、基本路径和数据生成器来为本指南生成记录。
```Scala
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_cow_table"
val basePath = "file:///tmp/hudi_cow_table"
val dataGen = new DataGenerator
```
[数据生成器](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java)
可以基于[行程样本模式](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java#L57)
生成插入和更新的样本。
## 插入数据 {#inserts}
生成一些新的行程样本,将其加载到DataFrame中,然后将DataFrame写入Hudi数据集中,如下所示。
```Scala
val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath);
```
`mode(Overwrite)`覆盖并重新创建数据集(如果已经存在)。
您可以检查在`/tmp/hudi_cow_table/<region>/<country>/<city>/`下生成的数据。我们提供了一个记录键
([schema](#sample-schema)中的`uuid`),分区字段(`region/county/city`)和组合逻辑([schema](#sample-schema)中的`ts`)
以确保行程记录在每个分区中都是唯一的。更多信息请参阅
[对Hudi中的数据进行建模](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-HowdoImodelthedatastoredinHudi?),
有关将数据提取到Hudi中的方法的信息,请参阅[写入Hudi数据集](https://hudi.apache.org/writing_data.html)
这里我们使用默认的写操作:`插入更新`。 如果您的工作负载没有`更新`,也可以使用更快的`插入``批量插入`操作。
想了解更多信息,请参阅[写操作](https://hudi.apache.org/writing_data.html#write-operations)
## 查询数据 {#query}
将数据文件加载到DataFrame中。
```Scala
val roViewDF = spark.
read.
format("org.apache.hudi").
load(basePath + "/*/*/*/*")
roViewDF.registerTempTable("hudi_ro_table")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_ro_table where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_ro_table").show()
```
该查询提供已提取数据的读取优化视图。由于我们的分区路径(`region/country/city`)是嵌套的3个级别
从基本路径开始,我们使用了`load(basePath + "/*/*/*/*")`
有关支持的所有存储类型和视图的更多信息,请参考[存储类型和视图](https://hudi.apache.org/concepts.html#storage-types--views)
## 更新数据 {#updates}
这类似于插入新数据。使用数据生成器生成对现有行程的更新,加载到DataFrame中并将DataFrame写入hudi数据集。
```Scala
val updates = convertToStringList(dataGen.generateUpdates(10))
val df = spark.read.json(spark.sparkContext.parallelize(updates, 2));
df.write.format("org.apache.hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Append).
save(basePath);
```
注意,保存模式现在为`追加`。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。
[查询](#query)现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的[commit](http://hudi.incubator.apache.org/concepts.html)
。在之前提交的相同的`_hoodie_record_key`中寻找`_hoodie_commit_time`, `rider`, `driver`字段变更。
## 增量查询
Hudi还提供了获取给定提交时间戳以来已更改的记录流的功能。
这可以通过使用Hudi的增量视图并提供所需更改的开始时间来实现。
如果我们需要给定提交之后的所有更改(这是常见的情况),则无需指定结束时间。
```Scala
// reload data
spark.
read.
format("org.apache.hudi").
load(basePath + "/*/*/*/*").
createOrReplaceTempView("hudi_ro_table")
val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50)
val beginTime = commits(commits.length - 2) // commit time we are interested in
// 增量查询数据
val incViewDF = spark.
read.
format("org.apache.hudi").
option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
load(basePath);
incViewDF.registerTempTable("hudi_incr_table")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()
```
这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。
## 特定时间点查询
让我们看一下如何查询特定时间的数据。可以通过将结束时间指向特定的提交时间,将开始时间指向"000"(表示最早的提交时间)来表示特定时间。
```Scala
val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // commit time we are interested in
// 增量查询数据
val incViewDF = spark.read.format("org.apache.hudi").
option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL).
option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
option(END_INSTANTTIME_OPT_KEY, endTime).
load(basePath);
incViewDF.registerTempTable("hudi_incr_table")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()
```
## 从这开始下一步?
您也可以通过[自己构建hudi](https://github.com/apache/incubator-hudi#building-apache-hudi-from-source)来快速开始,
并在spark-shell命令中使用`--jars <path to hudi_code>/packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar`
而不是`--packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating`
这里我们使用Spark演示了Hudi的功能。但是,Hudi可以支持多种存储类型/视图,并且可以从Hive,Spark,Presto等查询引擎中查询Hudi数据集。
我们制作了一个基于Docker设置、所有依赖系统都在本地运行的[演示视频](https://www.youtube.com/watch?v=VhNgUsxdrD0)
我们建议您复制相同的设置然后按照[这里](docker_demo.html)的步骤自己运行这个演示。
另外,如果您正在寻找将现有数据迁移到Hudi的方法,请参考[迁移指南](migration_guide.html)
# S3 Filesystem
<!--
title: S3 Filesystem
keywords: hudi, hive, aws, s3, spark, presto
sidebar: mydoc_sidebar
permalink: s3_hoodie.html
toc: true
summary: In this page, we go over how to configure Hudi with S3 filesystem.
-->
In this page, we explain how to get your Hudi spark job to store into AWS S3.
## AWS configs
There are two configurations required for Hudi-S3 compatibility:
- Adding AWS Credentials for Hudi
- Adding required Jars to classpath
### AWS Credentials
Simplest way to use Hudi with S3, is to configure your `SparkSession` or `SparkContext` with S3 credentials. Hudi will automatically pick this up and talk to S3.
Alternatively, add the required configs in your core-site.xml from where Hudi can fetch them. Replace the `fs.defaultFS` with your S3 bucket name and Hudi should be able to read/write from the bucket.
```xml
<property>
<name>fs.defaultFS</name>
<value>s3://ysharma</value>
</property>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>AWS_KEY</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>AWS_SECRET</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>AWS_KEY</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>AWS_SECRET</value>
</property>
```
Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with `HOODIE_ENV_`. For e.g below is a bash snippet to setup
such variables and then have cli be able to work on datasets stored in s3
```Java
export HOODIE_ENV_fs_DOT_s3a_DOT_access_DOT_key=$accessKey
export HOODIE_ENV_fs_DOT_s3a_DOT_secret_DOT_key=$secretKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsAccessKeyId=$accessKey
export HOODIE_ENV_fs_DOT_s3_DOT_awsSecretAccessKey=$secretKey
export HOODIE_ENV_fs_DOT_s3n_DOT_awsAccessKeyId=$accessKey
export HOODIE_ENV_fs_DOT_s3n_DOT_awsSecretAccessKey=$secretKey
export HOODIE_ENV_fs_DOT_s3n_DOT_impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```
### AWS Libs
AWS hadoop libraries to add to our classpath
- com.amazonaws:aws-java-sdk:1.10.34
- org.apache.hadoop:hadoop-aws:2.7.3
# 使用案例
<!--
title: 使用案例
keywords: hudi, data ingestion, etl, real time, use cases
sidebar: mydoc_sidebar
permalink: use_cases.html
toc: true
summary: "以下是一些使用Hudi的示例,说明了加快处理速度和提高效率的好处"
-->
## 近实时摄取
将外部源(如事件日志、数据库、外部源)的数据摄取到[Hadoop数据湖](http://martinfowler.com/bliki/DataLake.html)是一个众所周知的问题。
尽管这些数据对整个组织来说是最有价值的,但不幸的是,在大多数(如果不是全部)Hadoop部署中都使用零散的方式解决,即使用多个不同的摄取工具。
对于RDBMS摄取,Hudi提供 __通过更新插入达到更快加载__,而不是昂贵且低效的批量加载。例如,您可以读取MySQL BIN日志或[Sqoop增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)并将其应用于
DFS上的等效Hudi表。这比[批量合并任务](https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1770457)[复杂的手工合并工作流](http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/)更快/更有效率。
对于NoSQL数据存储,如[Cassandra](http://cassandra.apache.org/) / [Voldemort](http://www.project-voldemort.com/voldemort/) / [HBase](https://hbase.apache.org/),即使是中等规模大小也会存储数十亿行。
毫无疑问, __全量加载不可行__,如果摄取需要跟上较高的更新量,那么则需要更有效的方法。
即使对于像[Kafka](kafka.apache.org)这样的不可变数据源,Hudi也可以 __强制在HDFS上使用最小文件大小__, 这采取了综合方式解决[HDFS小文件问题](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/)来改善NameNode的健康状况。这对事件流来说更为重要,因为它通常具有较高容量(例如:点击流),如果管理不当,可能会对Hadoop集群造成严重损害。
在所有源中,通过`commits`这一概念,Hudi增加了以原子方式向消费者发布新数据的功能,这种功能十分必要。
## 近实时分析
通常,实时[数据集市](https://en.wikipedia.org/wiki/Data_mart)由专业(实时)数据分析存储提供支持,例如[Druid](http://druid.io/)[Memsql](http://www.memsql.com/)[OpenTSDB](http://opentsdb.net/)
这对于较小规模的数据量来说绝对是完美的([相比于这样安装Hadoop](https://blog.twitter.com/2015/hadoop-filesystem-at-twitter)),这种情况需要在亚秒级响应查询,例如系统监控或交互式实时分析。
但是,由于Hadoop上的数据太陈旧了,通常这些系统会被滥用于非交互式查询,这导致利用率不足和硬件/许可证成本的浪费。
另一方面,Hadoop上的交互式SQL解决方案(如Presto和SparkSQL)表现出色,在 __几秒钟内完成查询__。
通过将 __数据新鲜度提高到几分钟__,Hudi可以提供一个更有效的替代方案,并支持存储在DFS中的 __数量级更大的数据集__ 的实时分析。
此外,Hudi没有外部依赖(如专用于实时分析的HBase集群),因此可以在更新的分析上实现更快的分析,而不会增加操作开销。
## 增量处理管道
Hadoop提供的一个基本能力是构建一系列数据集,这些数据集通过表示为工作流的DAG相互派生。
工作流通常取决于多个上游工作流输出的新数据,新数据的可用性传统上由新的DFS文件夹/Hive分区指示。
让我们举一个具体的例子来说明这点。上游工作流`U`可以每小时创建一个Hive分区,在每小时结束时(processing_time)使用该小时的数据(event_time),提供1小时的有效新鲜度。
然后,下游工作流`D``U`结束后立即启动,并在下一个小时内自行处理,将有效延迟时间增加到2小时。
上面的示例忽略了迟到的数据,即`processing_time``event_time`分开时。
不幸的是,在今天的后移动和前物联网世界中,__来自间歇性连接的移动设备和传感器的延迟数据是常态,而不是异常__。
在这种情况下,保证正确性的唯一补救措施是[重新处理最后几个小时](https://falcon.apache.org/FalconDocumentation.html#Handling_late_input_data)的数据,
每小时一遍又一遍,这可能会严重影响整个生态系统的效率。例如; 试想一下,在数百个工作流中每小时重新处理TB数据。
Hudi通过以单个记录为粒度的方式(而不是文件夹/分区)从上游 Hudi数据集`HU`消费新数据(包括迟到数据),来解决上面的问题。
应用处理逻辑,并使用下游Hudi数据集`HD`高效更新/协调迟到数据。在这里,`HU``HD`可以以更频繁的时间被连续调度
比如15分钟,并且`HD`提供端到端30分钟的延迟。
为实现这一目标,Hudi采用了类似于[Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations)、发布/订阅系统等流处理框架,以及像[Kafka](http://kafka.apache.org/documentation/#theconsumer)
[Oracle XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187)等数据库复制技术的类似概念。
如果感兴趣,可以在[这里](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)找到有关增量处理(相比于流处理和批处理)好处的更详细解释。
## DFS的数据分发
一个常用场景是先在Hadoop上处理数据,然后将其分发回在线服务存储层,以供应用程序使用。
例如,一个Spark管道可以[确定Hadoop上的紧急制动事件](https://eng.uber.com/telematics/)并将它们加载到服务存储层(如ElasticSearch)中,供Uber应用程序使用以增加安全驾驶。这种用例中,通常架构会在Hadoop和服务存储之间引入`队列`,以防止目标服务存储被压垮。
对于队列的选择,一种流行的选择是Kafka,这个模型经常导致 __在DFS上存储相同数据的冗余(用于计算结果的离线分析)和Kafka(用于分发)__
通过将每次运行的Spark管道更新插入的输出转换为Hudi数据集,Hudi可以再次有效地解决这个问题,然后可以以增量方式获取尾部数据(就像Kafka topic一样)然后写入服务存储层。
# 写入 Hudi 数据集
<!--
title: 写入 Hudi 数据集
keywords: hudi, incremental, batch, stream, processing, Hive, ETL, Spark SQL
sidebar: mydoc_sidebar
permalink: writing_data.html
toc: true
summary: 这一页里,我们将讨论一些可用的工具,这些工具可用于增量摄取和存储数据。
-->
这一节我们将介绍使用[DeltaStreamer](#deltastreamer)工具从外部源甚至其他Hudi数据集摄取新更改的方法,
以及通过使用[Hudi数据源](#datasource-writer)的upserts加快大型Spark作业的方法。
对于此类数据集,我们可以使用各种查询引擎[查询](querying_data.html)它们。
## 写操作
在此之前,了解Hudi数据源及delta streamer工具提供的三种不同的写操作以及如何最佳利用它们可能会有所帮助。
这些操作可以在针对数据集发出的每个提交/增量提交中进行选择/更改。
- **UPSERT(插入更新)** :这是默认操作,在该操作中,通过查找索引,首先将输入记录标记为插入或更新。
在运行启发式方法以确定如何最好地将这些记录放到存储上,如优化文件大小之类后,这些记录最终会被写入。
对于诸如数据库更改捕获之类的用例,建议该操作,因为输入几乎肯定包含更新。
- **INSERT(插入)** :就使用启发式方法确定文件大小而言,此操作与插入更新(UPSERT)非常相似,但此操作完全跳过了索引查找步骤。
因此,对于日志重复数据删除等用例(结合下面提到的过滤重复项的选项),它可以比插入更新快得多。
插入也适用于这种用例,这种情况数据集可以允许重复项,但只需要Hudi的事务写/增量提取/存储管理功能。
- **BULK_INSERT(批插入)** :插入更新和插入操作都将输入记录保存在内存中,以加快存储优化启发式计算的速度(以及其它未提及的方面)。
所以对Hudi数据集进行初始加载/引导时这两种操作会很低效。批量插入提供与插入相同的语义,但同时实现了基于排序的数据写入算法,
该算法可以很好地扩展数百TB的初始负载。但是,相比于插入和插入更新能保证文件大小,批插入在调整文件大小上只能尽力而为。
## DeltaStreamer
`HoodieDeltaStreamer`实用工具 (hudi-utilities-bundle中的一部分) 提供了从DFS或Kafka等不同来源进行摄取的方式,并具有以下功能。
- 从Kafka单次摄取新事件,从Sqoop、HiveIncrementalPuller输出或DFS文件夹中的多个文件
[增量导入](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports)
- 支持json、avro或自定义记录类型的传入数据
- 管理检查点,回滚和恢复
- 利用DFS或Confluent [schema注册表](https://github.com/confluentinc/schema-registry)的Avro模式。
- 支持自定义转换操作
命令行选项更详细地描述了这些功能:
```Java
[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` --help
Usage: <main class> [options]
Options:
--commit-on-errors
Commit even when some records failed to be written
Default: false
--enable-hive-sync
Enable syncing to hive
Default: false
--filter-dupes
Should duplicate records from source be dropped/filtered outbefore
insert/bulk-insert
Default: false
--help, -h
--hudi-conf
Any configuration that can be set in the properties file (using the CLI
parameter "--propsFilePath") can also be passed command line using this
parameter
Default: []
--op
Takes one of these values : UPSERT (default), INSERT (use when input is
purely new data/inserts to gain speed)
Default: UPSERT
Possible Values: [UPSERT, INSERT, BULK_INSERT]
--payload-class
subclass of HoodieRecordPayload, that works off a GenericRecord.
Implement your own, if you want to do something other than overwriting
existing value
Default: org.apache.hudi.OverwriteWithLatestAvroPayload
--props
path to properties file on localfs or dfs, with configurations for
Hudi client, schema provider, key generator and data source. For
Hudi client props, sane defaults are used, but recommend use to
provide basic things like metrics endpoints, hive configs etc. For
sources, referto individual classes, for supported properties.
Default: file:///Users/vinoth/bin/hoodie/src/test/resources/delta-streamer-config/dfs-source.properties
--schemaprovider-class
subclass of org.apache.hudi.utilities.schema.SchemaProvider to attach
schemas to input & target table data, built in options:
FilebasedSchemaProvider
Default: org.apache.hudi.utilities.schema.FilebasedSchemaProvider
--source-class
Subclass of org.apache.hudi.utilities.sources to read data. Built-in
options: org.apache.hudi.utilities.sources.{JsonDFSSource (default),
AvroDFSSource, JsonKafkaSource, AvroKafkaSource, HiveIncrPullSource}
Default: org.apache.hudi.utilities.sources.JsonDFSSource
--source-limit
Maximum amount of data to read from source. Default: No limit For e.g:
DFSSource => max bytes to read, KafkaSource => max events to read
Default: 9223372036854775807
--source-ordering-field
Field within source record to decide how to break ties between records
with same key in input data. Default: 'ts' holding unix timestamp of
record
Default: ts
--spark-master
spark master to use.
Default: local[2]
* --target-base-path
base path for the target Hudi dataset. (Will be created if did not
exist first time around. If exists, expected to be a Hudi dataset)
* --target-table
name of the target table in Hive
--transformer-class
subclass of org.apache.hudi.utilities.transform.Transformer. UDF to
transform raw source dataset to a target dataset (conforming to target
schema) before writing. Default : Not set. E:g -
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer (which
allows a SQL query template to be passed as a transformation function)
```
该工具采用层次结构组成的属性文件,并具有可插拔的接口,用于提取数据、生成密钥和提供模式。
从Kafka和DFS摄取数据的示例配置在这里:`hudi-utilities/src/test/resources/delta-streamer-config`
例如:当您让Confluent Kafka、Schema注册表启动并运行后,可以用这个命令产生一些测试数据
[impressions.avro](https://docs.confluent.io/current/ksql/docs/tutorials/generate-custom-test-data.html)
由schema-registry代码库提供)
```Java
[confluent-5.0.0]$ bin/ksql-datagen schema=../impressions.avro format=avro topic=impressions key=impressionid
```
然后用如下命令摄取这些数据。
```Java
[hoodie]$ spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
--props file://${PWD}/hudi-utilities/src/test/resources/delta-streamer-config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field impresssiontime \
--target-base-path file:///tmp/hudi-deltastreamer-op --target-table uber.impressions \
--op BULK_INSERT
```
在某些情况下,您可能需要预先将现有数据集迁移到Hudi。 请参考[迁移指南](migration_guide.html)
## Datasource Writer
`hudi-spark`模块提供了DataSource API,可以将任何DataFrame写入(也可以读取)到Hudi数据集中。
以下是在指定需要使用的字段名称的之后,如何插入更新DataFrame的方法,这些字段包括
`recordKey => _row_key``partitionPath => partition``precombineKey => timestamp`
```Java
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // 可以传入任何Hudi客户端参数
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
```
## 与Hive同步
上面的两个工具都支持将数据集的最新模式同步到Hive Metastore,以便查询新的列和分区。
如果需要从命令行或在独立的JVM中运行它,Hudi提供了一个`HiveSyncTool`
在构建了hudi-hive模块之后,可以按以下方式调用它。
```Java
cd hudi-hive
./run_sync_tool.sh
[hudi-hive]$ ./run_sync_tool.sh --help
Usage: <main class> [options]
Options:
* --base-path
Basepath of Hudi dataset to sync
* --database
name of the target database in Hive
--help, -h
Default: false
* --jdbc-url
Hive jdbc connect url
* --pass
Hive password
* --table
name of the target table in Hive
* --user
Hive username
```
## 删除数据
通过允许用户指定不同的数据记录负载实现,Hudi支持对存储在Hudi数据集中的数据执行两种类型的删除。
- **Soft Deletes(软删除)** :使用软删除时,用户希望保留键,但仅使所有其他字段的值都为空。
通过确保适当的字段在数据集模式中可以为空,并在将这些字段设置为null之后直接向数据集插入更新这些记录,即可轻松实现这一点。
- **Hard Deletes(硬删除)** :这种更强形式的删除是从数据集中彻底删除记录在存储上的任何痕迹。
这可以通过触发一个带有自定义负载实现的插入更新来实现,这种实现可以使用总是返回Optional.Empty作为组合值的DataSource或DeltaStreamer。
Hudi附带了一个内置的`org.apache.hudi.EmptyHoodieRecordPayload`类,它就是实现了这一功能。
```Java
deleteDF // 仅包含要删除的记录的DataFrame
.write().format("org.apache.hudi")
.option(...) // 根据设置需要添加HUDI参数,例如记录键、分区路径和其他参数
// 指定record_key,partition_key,precombine_fieldkey和常规参数
.option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, "org.apache.hudi.EmptyHoodieRecordPayload")
```
## 存储管理
Hudi还对存储在Hudi数据集中的数据执行几个关键的存储管理功能。在DFS上存储数据的关键方面是管理文件大小和数量以及回收存储空间。
例如,HDFS在处理小文件上性能很差,这会对Name Node的内存及RPC施加很大的压力,并可能破坏整个集群的稳定性。
通常,查询引擎可在较大的列文件上提供更好的性能,因为它们可以有效地摊销获得列统计信息等的成本。
即使在某些云数据存储上,列出具有大量小文件的目录也常常比较慢。
以下是一些有效管理Hudi数据集存储的方法。
- Hudi中的[小文件处理功能](configurations.html#compactionSmallFileSize),可以分析传入的工作负载并将插入内容分配到现有文件组中,
而不是创建新文件组。新文件组会生成小文件。
- 可以[配置](configurations.html#retainCommits)Cleaner来清理较旧的文件片,清理的程度可以调整,
具体取决于查询所需的最长时间和增量拉取所需的回溯。
- 用户还可以调整[基础/parquet文件](configurations.html#limitFileSize)[日志文件](configurations.html#logFileMaxSize)的大小
和预期的[压缩率](configurations.html#parquetCompressionRatio),使足够数量的插入被分到同一个文件组中,最终产生大小合适的基础文件。
- 智能调整[批插入并行度](configurations.html#withBulkInsertParallelism),可以产生大小合适的初始文件组。
实际上,正确执行此操作非常关键,因为文件组一旦创建后就不能删除,只能如前所述对其进行扩展。
- 对于具有大量更新的工作负载,[读取时合并存储](concepts.html#merge-on-read-storage)提供了一种很好的机制,
可以快速将其摄取到较小的文件中,之后通过压缩将它们合并为较大的基础文件。
......@@ -61,6 +61,7 @@
<li><a href='doc/spark-220-doc-zh' target='_blank'>Spark 2.2.0 中文文档</a></li>
<li><a href='doc/storm-110-doc-zh' target='_blank'>Storm 1.1.0 中文文档</a></li>
<li><a href='doc/zeppelin-072-doc-zh' target='_blank'>Zeppelin 0.7.2 中文文档</a></li>
<li><a href='doc/hudi-050-doc-zh' target='_blank'>Hudi 0.5.0 中文文档</a></li>
<!-- toc end -->
</ul>
<h2>贡献指南</h2>
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册