nginx使用serde通过配置单元进行日志解析

zsohkypk  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(362)

我目前正在分析一个定制的nginx日志,并使用当前的配置单元脚本:

add jar s3://my-bucket-foo/hive-serde-0.13.1.jar;
SET hive.mapred.supports.subdirectories=true;
SET mapred.input.dir.recursive=true;
set hive.exec.compress.intermediate=true;
set mapred.compress.map.output=true;
set hive.exec.parallel=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;

DROP TABLE nginx_logs ;
CREATE EXTERNAL TABLE nginx_logs (
IP STRING,
`Timestamp` STRING,
Verb STRING,
URL STRING,
HTTPVersion STRING,
RequestProcessingTime STRING,
ReceivedBytes STRING,
URLReferer STRING,
UserAgent STRING,
MSISDN STRING,
XCALL STRING,
ResponseCode STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (

“输入.regex”=“输入.输入.REGEEX”““=“(\d{1,3}.{1,3}.{1,3}{1,3代理[\”([^\”]+)\“]\s+-\s+。\s+msisdn[([^]+)]\s+xcall[([^]+)]\s+(\d{1,})/gmi”

"input.regex" = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})\\s+-\\s+-\\s+\\[(\\d{2}\\/[a-z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s+)-\\d{4}\\]\\s+\"(GET)(.+)(http\\/1\\.1\")\\s+(\\d{1,}\\.\\d{3})\\s+(\\d+)\\s+\"([^\"]+)\"\\s+agent\\[\"([^\"]+)\"\\]\\s+-\\s+\\.\\s+msisdn\\[([^\\]]+)\\]\\s+xcall\\[([^\\]]+)\\]\\s+(\\d{1,}).*"
    )

LOCATION 's3n://my-bucket/EMRInput/';

下面是一些日志行和使用浏览器的示例:http://regex101.com/r/tw8yt5/1 采样线:

192.168.0.143 - - [25/Sep/2014:19:17:40 -0300]  "GET /adserver/www/delivery/lg.php?bannerid=4512&campaignid=374&zoneid=40&loc=1&cb=2b674aefb7 HTTP/1.1" 0.000  43 "http://wap.tim.com.br/html5/" Agent["Mozilla/5.0 (Linux; U; Android 4.1.2; pt-br; LG-E467f Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"] - . msisdn[-] xcall[552199999955] 200

根据regexp101,有12个匹配组:

但每当我执行查询时:
从nginx\u logs limit 10中选择*;
我收到一个错误,告诉我匹配组的数量与列的数量不匹配。

hive> select * from nginx_logs limit 10;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns
Time taken: 0.036 seconds

我刚刚对\(反斜杠)进行了双转义,现在得到的不是错误而是:

hive> select * from nginx_logs limit 1;
OK
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
Time taken: 0.037 seconds, Fetched: 1 row(s)

有什么想法吗?

w41d8nur

w41d8nur1#

在进一步了解了serde和hive如何处理regex之后,我只考虑了第一个匹配组的一个ip:

(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})

这将适用于大多数情况,除了那些有两个或多个ip(代理等),所以

200.222.108.241, 200.222.108.241, 200.222.108.241 - - [04/Oct/2014:06:30:48 -0300]  "GET /wml/redirect/jogos.wml HTTP/1.1" 0.000  154 "-" Agent["SAMSUNG-GT-E2222L/1.0 NetFront/4.1 Profile/MIDP-2.0 Configuration/CLDC-1.1"] - . msisdn[-] xcall[-] 302

不管用而且让我们头痛。解决方案来得很快,从一开始到破折号(-)都要用一组人来匹配:

([^-]*)\\s+-\\s+-\\s+\\[(\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s+)-\\d{4}\\]\\s+\"(GET)(.+)(HTTP\\/1\\.1\")\\s+(\\d{1,}\\.\\d{3})\\s+(\\d+)\\s+\"([^\"]+)\"\\s+Agent\\[\"([^\"]+)\"\\]\\s+-\\s+\\.\\s+msisdn\\[([^\\]]+)\\]\\s+xcall\\[([^\\]]+)\\]\\s+(\\d{1,}).*

在java主程序中进行测试,它的工作方式很有魅力:
匹配?是的

Group 1: 200.222.108.241, 200.222.108.241, 200.222.108.241
Group 2: 04/Oct/2014:06:30:48 
Group 3: GET
Group 4:  /wml/redirect/jogos.wml 
Group 5: HTTP/1.1"
Group 6: 0.000
Group 7: 154
Group 8: -
Group 9: SAMSUNG-GT-E2222L/1.0 NetFront/4.1 Profile/MIDP-2.0 Configuration/CLDC-1.1
Group 10: -
Group 11: -
Group 12: 302

瞧á, 我现在可以查询了。

相关问题