Hive导入Apache Nginx等日志与分析

将nginx日志导入到hive中的两种方法 1 在hive中建表

CREATE TABLE apachelog （ipaddress STRING, identd STRING, user STRING,finishtime STRING,requestline string, returncode INT, size INT,referer string,agent string） ROW FORMAT SERDE "org.apache.Hadoop.hive.serde2.dynamic_type.DynamicSerDe"WITH SERDEPROPERTIES （"serialization.format"="org.apache.hadoop.hive.serde2.thrift.TCTLSeparatedProtocol","quote.delim"="（"|\[|\]）","field.delim"=" ","serialization.null.format"="-"）STORED AS TEXTFILE;

导入后日志格式为 203.208.60.91 - - 05/May/2011:01:18:47 +0800 GET /robots.txt HTTP/1.1 404 1238 Mozilla/5.0 此方法支持hive中函数parse_url（referer,"HOST"）第二种方法导入注意：这个方法在建表后，使用查询语句等前要先执行hive> add jar /home/hjl/hive/lib/hive_contrib.jar;或者设置hive/conf/hive-default.conf 添加<property>
<name>hive.aux.jars.path</name>
<value>file:///usr/local/hadoop/hive/lib/hive-contrib-0.7.0-cdh3u0.jar</value>
</property>保存配置

CREATE TABLE apilog20110505 （ipaddress STRING,identity STRING,user STRING,time STRING,request STRING,protocol STRING,status STRING,size STRING,referer STRING,agent STRING） ROW FORMAT SERDE "org.apache.hadoop.hive.contrib.serde2.RegexSerDe" WITH SERDEPROPERTIES （"input.regex" = "（[^ ]*）（[^ ]*）（[^ ]*）（-|\[[^\]]*\]）（[^ "]*|"[^"]*）（[^ ]*"）（-|[0-9]*）（-|[0-9]*）（？: （[^ "]*|".*"）（[^ "]*|".*"））？","output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s"）STORED AS TEXTFILE;

203.208.60.91 - - [05/May/2011:01:18:47 +0800] "GET /robots.txt HTTP/1.1" 404 1238 "-" "Mozilla/5.0 （compatible; Googlebot/2.1; +http://www.google.com/bot.html）" 此方法中的字段类型string from deserializer 经测试不支持parse_url（referer,"HOST"）获取域名可以用select split（referer,"/"）[2] from apilog 获取域名如果文件数据是纯文本，可以使用 STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCE 。导入日志命令hive>load data local inpath "/home/log/map.gz" overwrite into table log; 导入日志支持.gz等格式导入日志后进行分析例句统计行数
select count（*） from nginxlog;

统计IP数
select count（DISTINCT ip） from nginxlog;

排行
select t2.ip,t2.xx from （SELECT ip, COUNT（*） AS xx FROM nginxlog GROUP by ip） t2 sort by t2.xx desc
hive>SELECT * from apachelog WHERE ipaddress = "216.211.123.184"; hive> SELECT ipaddress, COUNT（1） AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;

hive> set mapred.reduce.tasks=2;
hive> SELECT ipaddress, COUNT（1） AS numrequest FROM apachelog GROUP BY ipaddress SORT BY numrequest DESC LIMIT 1;

hive>CREATE TABLE ipsummary （ipaddress STRING, numrequest INT）;
hive>INSERT OVERWRITE TABLE ipsummary SELECT ipaddress, COUNT（1） FROM apachelog GROUP BY ipaddress;

hive>SELECT ipsummary.ipaddress, ipsummary.numrequest FROM （SELECT MAX（numrequest） AS themax FROM ipsummary） ipsummarymax JOIN ipsummary ON ipsummarymax.themax = ipsummary.numrequest;

hive查询结果导出为csv的方法（未测试）

hive> set hive.io.output.fileformat=CSVTextFile;
hive> insert overwrite local directory "/tmp/CSVrepos/" select * from S where ... ;

PHP连接Hive执行sql查询Hadoop数据迁入到Hive相关资讯 Apache Nginx Hive

Hive 简明教程 PDF （今 09:40）
Apache Hive v2.1.0-rc1 发布下载（06月04日）
在 Apache Hive 中轻松生存的12个（04月07日）

Apache Hive v2.1.0 发布下载（06月22日）
SparkSQL读取Hive中的数据（05月20日）
Apache Hive 2.0.0 发布下载，数据（02月17日）

本文评论查看全部评论（0）