需要帮助为以下日志格式创建架构。我使用athena来查询一堆具有这种格式的日志,但是我在为这种特定的日志格式创建表时遇到了一个问题。我尝试了几次不同的尝试,但都失败了。下面也是create table语句的最新尝试,但也不断失败。
CREATE TABLE "fastly_logs" (
"some_num" int,
"timestamp1" timestamp,
"cache_value" string,
"some_fastly" string,
"log_info" struct <
request_ipv4: string,
request_https: int,
request: string,
request_host: string,
request_uri: string,
http_code: int,
response: int,
req_header_size: int,
req_body_size: int,
resp_header_size: int,
resp_body_size: int,
content_type: string,
pop: string,
device_type: string,
referrer: string,
request_user_agent: string,
info_state: string,
hits: int,
tls_version: string,
>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^\D(\d{3})\D (\d{4}-\d{2}-\d{2}T.*Z) (cache-\D+{3}+\d{4,10}) (Prod-fastly\D+\d{6}\D) {(.*)}'
)
LOCATION 's3://prod-fastly/'
TBLPROPERTIES ('has_encrypted_data'='true')
日志格式:
<134>2020-02-10T16:04:22Z cache-dca17767 Prod-fastly[476113]: {"timestamp":"Mon, 10 Feb 2020 16:04:22 GMT", "request_ipv4":"111.111.111.111", "request_https":"true", "tls_version":"TLSv1.2", "request":"POST", "request_host":"test.example.com", "vp_stem":"www", "request_uri":"/this/ajax-test", "http_code":200, "response":"OK", "req_header_size":1215, "req_body_size":54, "resp_header_size":916, "resp_body_size":16, "content_type":"application/json", "pop":"DCA", "device_type":"", "referrer":"https://test.example.copm/some/path", "request_user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36", "info_state":"PASS", "hits":0 }
1条答案
按热度按时间64jmpszr1#
如果您知道哪些字符是可能的,我建议您不要使用\d。首先,在日期之前没有空格,所以这马上就不匹配了。应该是
<(\d+)>
^\D(\d{3})\D
你的约会看起来不错,但我不知道是什么形式timestamp
期望(\d{4}-\d{2}-\d{2}T.*Z)
缓存看起来很好,但是值似乎是十六进制的,所以应该使用cache-[0-9a-fA-F]{1,13}
,例如(cache-\D+{3}+\d{4,10})
同样,不要使用\d。最好是去Prod-fastly[\d+]
(Prod-fastly\D+\d{6}\D)
其余的都是json对象,是的,但是您不应该尝试解析它,所以它的类型必须是字符串。只是(.*)$
会做的;不要用花括号{(.*)}
我建议在字段之间留出一个以上的空间总共
^<(\d+)>(\d{4}-\d{2}-\d{2}T.*Z)\s+(cache-[0-9a-fA-F]{1,13})\s+(Prod-fastly\[\d+\])\s+(.*)$
您需要创建这个原始的纯文本表,然后可以使用struct创建另一个表,并在insert/select语句期间解析其中的json对象那就试试
SELECT json extract(remainder, "$.request_host") FROM fastly_logs;