python—是否可以选择特定的regex捕获组来在pandas中创建列?

h5qlskok  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(369)

我目前正在尝试用python编写一个程序,它读取syslog并运行regex负载,以便将数据提取到dataframe中的列中,然后可以对其执行分析。
这是syslog文件的示例:

Feb  1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.84 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=12973 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=24.17.237.70 DST=11.11.11.95 LEN=40 TOS=0x00 PREC=0x00 TTL=113 ID=27095 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:07 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.85 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=13801 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:17 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.87 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=15432 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:24 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=24.17.237.70 DST=11.11.11.100 LEN=40 TOS=0x00 PREC=0x00 TTL=113 ID=31168 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:27 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.89 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=17292 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.70 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7204 DF PROTO=TCP SPT=1208 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.72 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7206 DF PROTO=TCP SPT=1210 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0  
Feb  1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.64 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7198 DF PROTO=TCP SPT=1202 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0

到目前为止,我已经在pyspark中使用以下方法来加载数据文件:

raw_data_files = glob.glob('*.log')
base_df = spark.read.text(raw_data_files)

将数据提取到如下列中:

ts_pattern =r'((Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec)\s+(\d+)\s+((0[0-9]|1[0-9]|2[0-3])(\:)(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])(\:)(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])))'
mth_pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec\s+\d+\s+)'
dy_pattern = r'\s+(\d+)\s+'
tme_pattern = r'(([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))'
src_pattern = r'((\sSRC.*?=)([0-9a-f\.]*))'
dst_pattern = r'((\sDST.*?=)([0-9a-f\.]*))'
dvc_pattern = r'((([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))\s)(\S+)'
svc_pattern = r'((([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))\s)(\S+)\s((?:(?!:).)*)'
ufw_pattern = r'(\bUFW\s\b)([^]\s]+)'
sts_pattern = r'((?<=stats:\s).*)'
in_pattern = r'((\bIN=\b)([^\s]+))'
physin_pattern = r'((\bPHYSIN=\b)([^\s]+))'
out_pattern = r'((\bOUT=\b)([^\s]+))'
physout_pattern = r'((\bPHYSOUT=\b)([^\s]+))'
mac_pattern = r'((\bMAC=\b)([^\s]+))'
len_pattern = r'((\bLEN=\b)([^\s]+))'
tos_pattern = r'((\bTOS=\b)([^\s]+))'
prec_pattern = r'((\bPREC=\b)([^\s]+))'
ttl_pattern = r'((\bTTL=\b)([^\s]+))'
id_pattern = r'((\bID=\b)([^\s]+))'
flag_pattern = r'(\b\sCE|DF|MF\s\b)'
prtc_pattern = r'((\bPROTO=\b)([^\s]+))'
spt_pattern = r'((\bSPT=\b)([^\s]+))'
dpt_pattern = r'((\bDPT=\b)([^\s]+))'
recseq_pattern = r'((\bSEQ=\b)([^\s]+))'
ackseq_pattern = r'((\bSEQ=\b)([^\s]+))'
win_pattern = r'((\bWINDOW=\b)([^\s]+))'
res_pattern = r'((\bRES=\b)([^\s]+))'
synurgp_pattern = r'((\bSYN\sURGP=\b)([^\s]+))'
status_pattern = r'((\bkernel:\s\b)(((eth|br|dev|Ker).*)))'

    # Combined regex
logs_df = base_df.select(regexp_extract('value', mth_pattern, 1).alias('month'),
                         regexp_extract('value', dy_pattern, 1).alias('day'),
                         regexp_extract('value', tme_pattern, 1).alias('time'),
                         regexp_extract('value', src_pattern, 3).alias('source'),
                         regexp_extract('value', dst_pattern, 3).alias('destination'),
                         regexp_extract('value', dvc_pattern, 6).alias('device'),
                         regexp_extract('value', svc_pattern, 7).alias('service'),
                         regexp_extract('value', ufw_pattern, 2).alias('ufw rule'),
                         regexp_extract('value', sts_pattern, 1).alias('stats event'),
                         regexp_extract('value', status_pattern, 3).alias('status event'),
                         regexp_extract('value', in_pattern, 3).alias('input'),
                         regexp_extract('value', physin_pattern, 3).alias('physinput'),
                         regexp_extract('value', out_pattern, 3).alias('output'),
                         regexp_extract('value', physout_pattern, 3).alias('physout'),
                         regexp_extract('value', mac_pattern, 3).alias('mac address'),
                         regexp_extract('value', len_pattern, 3).cast('integer').alias('length'),
                         regexp_extract('value', tos_pattern, 3).alias('typeofservice'),
                         regexp_extract('value', prec_pattern, 3).alias('precedence'),
                         regexp_extract('value', ttl_pattern, 3).cast('integer').alias('timetolive'),
                         regexp_extract('value', id_pattern, 3).cast('integer').alias('id'),
                         regexp_extract('value', flag_pattern, 1).alias('flag'),
                         regexp_extract('value', prtc_pattern, 3).alias('protocol'),
                         regexp_extract('value', spt_pattern, 3).cast('integer').alias('sourceport'),
                         regexp_extract('value', dpt_pattern, 3).cast('integer').alias('destinationport'),
                         regexp_extract('value', recseq_pattern, 3).cast('integer').alias('recseqnumber'),
                         regexp_extract('value', ackseq_pattern, 3).cast('integer').alias('ackseqnumber'),
                         regexp_extract('value', win_pattern, 3).cast('integer').alias('window'),
                         regexp_extract('value', res_pattern, 3).alias('reserved'),
                         regexp_extract('value', synurgp_pattern, 3).cast('integer').alias('synurgp'))

打印Dataframe:

+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
|month|day|    time|        source| destination|device|service|ufw rule|stats event|status event|input|physinput|output|physout|mac address|length|typeofservice|precedence|timetolive|   id|flag|protocol|sourceport|destinationport|recseqnumber|ackseqnumber|window|reserved|synurgp|
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
|  Feb|  1|00:00:02|192.150.249.87| 11.11.11.84|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|12973|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:02|  24.17.237.70| 11.11.11.95|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       113|27095|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:07|192.150.249.87| 11.11.11.85|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|13801|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:17|192.150.249.87| 11.11.11.87|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|15432|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:24|  24.17.237.70|11.11.11.100|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       113|31168|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:27|192.150.249.87| 11.11.11.89|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|17292|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:31|211.168.230.94| 11.11.11.70|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7204|  DF|     TCP|      1208|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:31|211.168.230.94| 11.11.11.72|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7206|  DF|     TCP|      1210|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:31|211.168.230.94| 11.11.11.64|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7198|  DF|     TCP|      1202|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.69|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7203|  DF|     TCP|      1207|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.73|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7207|  DF|     TCP|      1211|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.75|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7209|  DF|     TCP|      1213|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.80|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7214|  DF|     TCP|      1218|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.67|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7201|  DF|     TCP|      1205|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.71|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7205|  DF|     TCP|      1209|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|192.150.249.87| 11.11.11.90|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    40|         0x00|      0x00|       110|18107|    |     TCP|       220|           6129|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.70|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7235|  DF|     TCP|      1208|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.72|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7237|  DF|     TCP|      1210|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.69|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7234|  DF|     TCP|      1207|            135|        null|        null| 16384|    0x00|      0|
|  Feb|  1|00:00:32|211.168.230.94| 11.11.11.80|bridge| kernel|        |           |            |  br0|     eth0|   br0|   eth1|           |    48|         0x00|      0x00|       110| 7233|  DF|     TCP|      1218|            135|        null|        null| 16384|    0x00|      0|
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+

这个方法工作得很好,我对结果很满意,但是,如果可能的话,我希望不使用apachespark就可以工作。这是一个很好的工具,但我认为对于这个用例来说可能太多了。
我曾尝试使用pandas通过将文件读取为fwf来获得类似的结果,但由于日志格式的原因,fwf并不理想,因为它将文件拆分为奇数列。如果我可以使用特定的正则表达式组(如pyspark中的正则表达式组)从Dataframe中提取数据,这不会是一个问题。我在Pandas身上找不到这种方法。或者无论如何,将所有regex编译成一个命令,并使用指定的组以这种方式创建列。
我尝试过将pandas数据框中的所有列合并为1,然后使用str.extract将数据提取到新的列中。但是这种方法只在正则表达式有1个捕获组时有效,这是不可能实现的,并且仍然可以得到相关的结果。
我还尝试将我所有的正则表达式模式组合成一个,并以这种方式提取数据,如下所示:

import re
import pandas as pd
import glob
import numpy as np

# define regex patterns

mth = r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec\s+\d+\s+)'
dy = r'\s+(?P<day>\d+)\s+'
tme = r'(?P<time>([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))'
src = r'\sSRC.*?=?(?P<source>[0-9a-f\.]*)'
dst = r'\sDST.*?=(?P<destination>[0-9a-f\.]*)'

# import all files

data = pd.read_fwf('/home/fred/Documents/PythonProjects/LogFileReader/SotM30-anton.log', header=None)
all_re = re.compile('({0}{1}{2}{3}{4})'.format(mth, dy, tme, src, dst))

# parse to dataframe

df = pd.DataFrame(data)

# format to single column

df = df.rename(columns={0 : 'c1', 1 : 'c2', 2 : 'c3', 3 : 'c4', 4 : 'c5'})
cols = ['c1', 'c2', 'c3', 'c4', 'c5']
df['combined'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
print(df)

打印结果(df):

c1  c2        c3      c4       c5                                                  5                       combined
0       Feb   1  00:00:02  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:02 bridge kernel:
1       Feb   1  00:00:02  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:02 bridge kernel:
2       Feb   1  00:00:07  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:07 bridge kernel:
3       Feb   1  00:00:17  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:17 bridge kernel:
4       Feb   1  00:00:24  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 1 00:00:24 bridge kernel:
...     ...  ..       ...     ...      ...                                                ...                            ...
307519  Feb   7  14:35:06  bridge  kernel:  OUTG CONN TCP: IN=br0 PHYSIN=eth1 OUT=br0 PHYS...  Feb 7 14:35:06 bridge kernel:
307520  Feb   7  14:36:12  bridge  kernel:  INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 7 14:36:12 bridge kernel:
307521  Feb   7  14:39:04  bridge  kernel:  INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 7 14:39:04 bridge kernel:
307522  Feb   7  14:39:11  bridge  kernel:  INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU...  Feb 7 14:39:11 bridge kernel:
307523  Feb   7  14:40:06  bridge  kernel:  OUTG CONN TCP: IN=br0 PHYSIN=eth1 OUT=br0 PHYS...  Feb 7 14:40:06 bridge kernel:
new_df = df[['combined']].copy()
x = new_df['combined'].str.extract(all_re, expand=True)
print(x)

但这只会导致Dataframe中填充nan(打印结果(x)):

0 month  day time    4    5    6 source destination
0       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
1       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
2       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
3       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
4       NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
...     ...   ...  ...  ...  ...  ...  ...    ...         ...
307519  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307520  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307521  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307522  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN
307523  NaN   NaN  NaN  NaN  NaN  NaN  NaN    NaN         NaN

它似乎得到了捕获组名并将其用作列标题,但由于没有得到任何值,所以出现了一些问题。
总之,我想使用正则表达式从syslog文件中提取数据,并将结果解析为dataframe中的列,以便对数据进行分析。我可以使用apachespark来实现这一点,但是我不希望这样。我想使用pandas,但不知道如何指定regex捕获组并像apachespark那样将它们一起编译。任何帮助都将不胜感激。

qcuzuvrc

qcuzuvrc1#

标准re库中的组词典可能对您有用:groupdict
一旦你有了匹配的字典,你应该能够遍历它们来建立一个表。
另请参阅re docs:writing a tokenizer中的tokenizer示例
希望这有帮助!

相关问题