| 注册
请输入搜索内容

热门搜索

Java Linux MySQL PHP JavaScript Hibernate jQuery Nginx
jopen
9年前发布

Go开发的基于Hadoop的ETL抽取工具:Crunch

快速开发,快速运行,基于Go工具包。实现基于 Hadoop 的 ETL 和特性抽取工具。

快速入门

Crunch is optimized to be a big-bang-for-the-buck libary, yet almost every aspect is extensible.

Let's say you have a log of semi-structured and deeply nested JSON. Each line contains a record.

You would like to:

  1. Parse JSON records
  2. Extract fields
  3. Cleanup/process fields
  4. Extract features - run custom code on field values and output the result as new field(s)

Go开发的基于Hadoop的ETL抽取工具:Crunch

所以这里有一个详细的视图:

// Describe your row  transform := crunch.NewTransformer()  row := crunch.NewRow()  // Use "field_name type". Types are Hive types.  row.FieldWithValue("ev_smp int", "1.0")  // If no type given, assume 'string'  row.FieldWithDefault("ip", "0.0.0.0", makeQuery("head.x-forwarded-for"), transform.AsIs)  row.FieldWithDefault("ev_ts", "", makeQuery("action.timestamp"), transform.AsIs)  row.FieldWithDefault("ev_source", "", makeQuery("action.source"), transform.AsIs)  row.Feature("doing ip to location", []string{"country", "city"},    func(r crunch.DataReader, row *crunch.Row)[]string{      // call your "standard" Go code for doing ip2location      return ip2location(row["ip"])    })    // By default, will build a hadoop-compatible streamer process that understands json: (stdin[JSON] to stdout[TSV])  // Also will plug-in Crunch's CLI utility functions (use -help)  crunch.ProcessJson(row)


 

项目主页:http://www.open-open.com/lib/view/home/1416465525055

 本文由用户 jopen 自行上传分享,仅供网友学习交流。所有权归原作者,若您的权利被侵害,请联系管理员。
 转载本站原创文章,请注明出处,并保留原始链接、图片水印。
 本站是一个以用户分享为主的开源技术平台,欢迎各类分享!
 本文地址:https://www.open-open.com/lib/view/open1416465525055.html
Crunch 数据挖掘