在Pandas DataFrame中处理SQL数据有没有更简单的方法？

92dk7w1h 于 2023-06-28 发布在其他

关注(0)|答案(1)|浏览(95)

假设我正在处理一个大型数据库，该数据库显示了整个图书馆中书籍的详细信息。我想得到这批藏品的各种统计数据。例如，在这段代码中，我定义了一个方法来获取前10名最具代表性的作者。

def most_owned_authors():
    db = 'database.db'
    conn = sqlite3.connect(db)
    cursor = conn.cursor()

    cursor.execute('''SELECT AUTHOR AS author FROM MAINTABLE WHERE OWNEDCHECKBOX = TRUE;''')

    authors_df = pd.DataFrame(cursor.fetchall())

    authors = []
    author_dict = {}

    for x in authors_df.iloc:
        authors.append(x.to_string(index=False))
    for x in authors:
        amount = authors.count(x)
        author_dict[x] = amount

    author_dict = dict(sorted(author_dict.items(), key=lambda item: item[1], reverse=True))

    top_10_owned_authors = {}

    for x, k in enumerate(author_dict):
        if x == 10: break
        top_10_owned_authors[k] = author_dict[k]

有没有更简单的方法，使用SQLite3和Pandas从SQL查询中生成统计数据？或者我必须以上述方式手动创建逻辑？SQL SUM/COUNT是否可以跨同一条目的多个示例，然后仅从DataFrame中提取索引？同样的想法的另一个例子是使用SQL和Pandas来生成标记为“read”的书籍的DataFrame以及它们被阅读的“年份”。

pandas

来源：https://stackoverflow.com/questions/76548838/is-there-an-easier-way-to-work-with-sql-data-in-a-pandas-dataframe

1条答案

按热度按时间

46scxncf1#

我认为你可以用两种不同的方式来做：
1-如果你的数据库中有很多记录，并且你不想获取所有的记录然后计算统计数据。在这种情况下，最好通过编写SQL查询来完成工作并计算统计数据，从而将负载放入数据库。示例：

import pandas as pd

top_10_df = pd.read_sql('''SELECT AUTHOR, count(*) as counts AS author FROM MAINTABLE WHERE OWNEDCHECKBOX = TRUE GROUP BY AUTHOR ORDER BY counts DESC LIMIT 10;''', conn)

如果SQL是可读的并且不混乱，这将是一个很好的解决方案！但是如果逻辑比较复杂，并且表不大，那么最好以可读的方式编写逻辑（因为这样更容易维护）。
2-如果可以加载代码中的所有数据（当您有一个小表并且查询不会频繁执行时）：
在这里，您可以使用'group by'，'agg'和'describe'来计算加载的 Dataframe 的统计信息（更多信息可以在这里找到：Get statistics for each group (such as count, mean, etc) using pandas GroupBy?）。参见以下示例：

# setup
import pandas as pd
import random  # just to generate some data
# generating a sample author list 
authors = ['A{}'.format(random.randint(1,idx)) for idx in range(1,100)]
authors_df = pd.DataFrame({'authors': authors}) # a dataframe of list of authors with duplicates - so, the task is to select top 10 

# group by - count how many records refer to an author
agg_author_df = authors_df.groupby('authors').size().reset_index(name='counts')

# sort and limit to 10
agg_author_df.sort_values(by='counts', ascending=False, inplace=True)
top_10_df = agg_author_df.head(10)

赞(0）回复(0）举报 2023-06-28

我来回答

在Pandas DataFrame中处理SQL数据有没有更简单的方法？

1条答案

相关问题

热门标签

最新问答