python 为什么下面的对象是numpy字符串而不是datetime.datetime对象？

ymzxtsji 于 2023-01-01 发布在 Python

关注(0)|答案(2)|浏览(132)

我有一个csv文件安排如下：

Person,Date1,Date2,Status
Person1,12/10/11,17/10/11,Done
...

我想对它执行各种操作，我首先将它拉入Python，并将日期字符串转换为datetime.datetime对象，我有以下代码：

import re
import numpy as np
from datetime import datetime, timedelta
from dateutil import rrule

def get_data(csv_file = '/home/garry/Desktop/complaints/input.csv'):
    inp = np.genfromtxt(csv_file,
        delimiter=',',
        filling_values = None,
        dtype = None)

    date = re.compile(r'\d+/\d+/\d+')
    count = 0
    item_count = 0

    for line in inp:
        for item in line:
            if re.match(date, item):
                item = datetime.strptime(item, '%d/%m/%y')
                inp[count][item_count] = item
                item_count += 1
            else:
                item_count += 1
        item_count = 0
        count += 1

    return inp

def get_teams(data):
    team_list = []
    for line in data:
        if line[0] not in team_list:
            team_list.append(line[0])
        else:
            pass
    del team_list[0]
    return team_list

def get_months():
    month_list = []
    months = [1,2,3,4,5,6,7,8,9,10,11,12]
    now = datetime.now()
    start_month = now.month - 7
    for count in range(0,7):
        if months[start_month] > now.month:
            year = now.year - 1
        else:
            year = now.year
        month_list.append([months[start_month], year])
        start_month += 1
    return month_list

if __name__ == "__main__":
    inp = get_data()
    for item in inp[2]:
        print type(item)
    team_list = get_teams(inp)
    month_list = get_months()

main方法中的print语句（为调试而插入）返回：

<type 'numpy.string_'>
<type 'numpy.string_'>
<type 'numpy.string_'>
<type 'numpy.string_'>

这显然不是我所希望的因为get_data中的循环（）函数应该将日期字符串更改为datetime.datetime对象。当我在单个日期字符串上运行循环中的相同代码作为测试时，它们可以很好地转换Type。在上面的代码中，它们在某种意义上也是有效的，因为字符串确实更改为datetime.datetime格式-它们只是没有“不是正确的类型。有人能看出我做错了什么吗？

python

来源：https://stackoverflow.com/questions/12028638/why-are-the-following-objects-numpy-strings-instead-of-datetime-datetime-objects

2条答案

按热度按时间

2jcobegt1#

问题是numpy数组的类型是固定的。Numpy将数据存储在固定大小的连续内存块中，因此当您将值赋给numpy数组中的索引时，numpy会在将其存储到数组中之前对其进行转换。即使是字符串数组，它也会这样做。例如：

>>> a = numpy.array(['xxxxxxxxxx'] * 10)
>>> for index, datum in enumerate(a):
...     print datum, a[index], type(a[index])
...     a[index] = 5
...     print datum, a[index], type(a[index])
... 
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>

方便（或不方便！）datetime.datetime对象可以使用str进行转换，因此在这一行中...

inp[count][item_count] = item

... numpy只是将项转换为字符串并将其插入数组。
现在，你可以使用dtype=object来绕过这个行为，但是这样做会抵消numpy的很多速度，因为你强迫numpy调用一堆很慢的python代码。

>>> a = numpy.array(['xxxxxxxxxx'] * 10, dtype=object)
>>> for index, datum in enumerate(a):
...     print datum, a[index], type(a[index])
...     a[index] = 5
...     print datum, a[index], type(a[index])
... 
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>

我要补充的是，您在这里没有充分发挥numpy的潜力，Numpy被设计为以矢量化的方式处理数组，没有显式的for循环。（有关详细信息，请参阅tutorial。）因此，无论何时使用for循环处理numpy，很自然会问如何才能避免这样做。与其指出代码中存在的问题，不如向您展示一件有趣的事情：

>>> numpy.genfromtxt('input.csv', delimiter=',', dtype=None, names=True)
array([('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done')], 
      dtype=[('Person', '|S7'), ('Date1', '|S8'), 
             ('Date2', '|S8'), ('Status', '|S4')])
>>> a = numpy.genfromtxt('input.csv', delimiter=',', dtype=None, names=True)
>>> a['Status']
array(['Done', 'Done', 'Done', 'Done', 'Done', 'Done'], 
      dtype='|S4')
>>> a['Date1']
array(['12/10/11', '12/10/11', '12/10/11', '12/10/11', '12/10/11',
       '12/10/11'], 
      dtype='|S8')

现在，您可以直接访问日期，而不是使用正则表达式遍历表。

赞(0）回复(0）举报 2023-01-01

f0brbegy2#

问题是，在get_data中定义的inp数组从np.genfromtxt中获取了一个"|S8 dtype。如果试图用另一个对象替换它的一个元素，则该对象将转换为字符串。
第一个想法是将inp转换为一个列表，其中包含inp.tolist()，这样，您可以根据自己的需要更改每个字段的类型，但还有更好的方法（我认为）：
根据您的示例，第二列和第三列始终是日期，对吗？然后，您可以直接使用np.genfromtxt将字符串转换为datetime对象

np.genfromtxt(csv_file,
              delimiter=",",
              dtype=None,
              names=True,
              converters={1:lambda d:datetime.strptime(d,"%d/%m/%y"),
                          2:lambda d:datetime.strptime(d,"%d/%m/%y")})

names=True意味着您将得到一个结构化的ndarray作为输出，其中的字段取自第一个未注解行（这里是Person,Date1,Date2,Status）。正如您所猜，converters关键字将把第2列和第3列中的字符串转换为datetime对象。
注意，如果你已经知道你的第一列和最后一列是字符串，你可能想使用另一个dtype而不是None：如果np.genfromtxt不需要猜测每列的类型，它的工作速度会更快。
现在，另一个评论：

与其在for循环中保留计数器，不如使用类似for (i, item) in enumerate(whatever)的计数器，这样更简单。

赞(0）回复(0）举报 2023-01-01

我来回答

python 为什么下面的对象是numpy字符串而不是datetime.datetime对象？

2条答案

相关问题

热门标签

最新问答