当我运行这段代码并查看info()的输出时,使用Category类型的DataFrame似乎比使用Object类型的DataFrame(624字节)占用更多的空间(932字节)。
def initData():
myPets = {"animal": ["cat", "alligator", "snake", "dog", "gerbil", "lion", "gecko", "hippopotamus", "parrot", "crocodile", "falcon", "hamster", "guinea pig"],
"feel" : ["furry", "rough", "scaly", "furry", "furry", "furry", "rough", "rough", "feathery", "rough", "feathery", "furry", "furry" ],
"where lives": ["indoor", "outdoor", "indoor", "indoor", "indoor", "outdoor", "indoor", "outdoor", "indoor", "outdoor", "outdoor", "indoor", "indoor" ],
"risk": ["safe", "dangerous", "dangerous", "safe", "safe", "dangerous", "safe", "dangerous", "safe", "dangerous", "safe", "safe", "safe" ],
"favorite food": ["treats", "fish", "bugs", "treats", "grain", "antelope", "bugs", "antelope", "grain", "fish", "rabbit", "grain", "grain" ],
"want to own": [1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1 ] }
petDF = pd.DataFrame(myPets)
petDF = petDF.set_index("animal")
#print(petDF.info())
#petDF.head(100)
return petDF
def addCategoryColumns(myDF):
myDF["cat_feel"] = myDF["feel"].astype("category")
myDF["cat_where_lives"] = myDF["where lives"].astype("category")
myDF["cat_risk"] = myDF["risk"].astype("category")
myDF["cat_favorite_food"] = myDF["favorite food"].astype("category")
return myDF
objectsDF = initData()
categoriesDF = initData()
categoriesDF = addCategoryColumns(categoriesDF)
categoriesDF = categoriesDF.drop(["feel", "where lives", "risk", "favorite food"], axis = 1)
print(objectsDF.info())
print(categoriesDF.info())
categoriesDF.head()
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, cat to guinea pig
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 feel 13 non-null object
1 where lives 13 non-null object
2 risk 13 non-null object
3 favorite food 13 non-null object
4 want to own 13 non-null int64
dtypes: int64(1), object(4)
memory usage: 624.0+ bytes
None
<class 'pandas.core.frame.DataFrame'>
Index: 13 entries, cat to guinea pig
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 want to own 13 non-null int64
1 cat_feel 13 non-null category
2 cat_where_lives 13 non-null category
3 cat_risk 13 non-null category
4 cat_favorite_food 13 non-null category
dtypes: category(4), int64(1)
memory usage: 932.0+ bytes
None
1条答案
按热度按时间9o685dep1#
数值型数据,如int / float / category,保存在numpy数组中,如果在其中放入100万或2万行,那么簿记开销就微不足道了,你会发现内存使用量正好是8 × num_elements,或者对于小于64位的数据类型,内存使用量是8 × num_elements的倍数。
相反,“object”dtype是指向外部分配的内存区域的指针,通常是
str
,所以numpy / pandas报告数组大小,当使用64位地址时为8 × num_elements,但是要把所有这些外部分配加起来就留给你了。递归地使用getsizeof或使用pympler可以更好地了解总内存消耗。或者使用psutil在进行大的分配之前/之后询问操作系统有关内存资源的信息。