背景：

我正在创建一个平台使用python，用户（外行）将能够上传自己的数据库中的数据。用户将选择一个excel文件和python将创建多个dataframe，将存储在MS SQL Server的数据库中的每个相应的表。

情况：

我正在使用excel文件创建12个不同的 Dataframe ，并将其存储在MS SQL数据库中。该文件大约有5万行和150列（总共16mb文件）。代码运行得非常好，但时间效率不高因为仅将这12帧上传到数据库就需要大约2 - 3分钟。我在一个更大的文件上进行了测试运行（约50mb），将这12帧上传到数据库所需的时间为7分钟

我需要支持的地方：

有什么方法可以加快将数据存储到数据库的过程吗？理想情况下，这应该只是几秒钟的事情，而不是几分钟。我已经尝试了以下库，并得到了如下结果。

- 数据框中的连接字符串和数据加载：**

#Connection String
connection_string = f"""
    DRIVER={{{DRIVER_NAME}}};
    SERVER={{{SERVER_NAME}}};
    DATABASE={{{DATABASE_NAME}}};
    uid=XYZ;
    pwd=XYZ;
    Trust_Connection=yes;
    ColumnEncryption=Enabled;
    """

#Connection to Database
params=urllib.parse.quote_plus(connection_string)
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)
con=engine.connect()

#DataFrame 1 to be stored in DB table_1 of DB
df_Addr = pd.read_excel(excel_file, sheet_name = "Address_Details")

#DataFrame 2 to be stored in DB table_2 of DB
df_Bank = pd.read_excel(excel_file, sheet_name = "Bank_Details")

.
.
.

#DataFrame 12 to be stored in DB table_12 of DB
df_N = pd.read_excel(excel_file, sheet_name = "N_Details")

***备选案文1：***：使用SQLAlchemy

#Saving Frame 1 in Table 1
saving_query_Address='DQ_Raw_Address'
df_Addr.to_sql(saving_query_Address,engine,schema="dbo",if_exists='append',index=False, chunksize = 5000, dtype={'NAME1': sa.types.NVARCHAR(length=100), 'CITY1': sa.types.NVARCHAR(length=100), 'STREET': sa.types.NVARCHAR(length=100)})

#Saving Frame 2 in Table 2
saving_query_Bank='DQ_Raw_Bank'
df_Bank.to_sql(saving_query_Bank,engine,schema="dbo",if_exists='append',index=False, chunksize = 5000, dtype={'_COMMENT':sa.types.VARCHAR(length=100),'_ACTION_CODE':sa.types.VARCHAR(length=100),'SOURCE_ID':sa.types.VARCHAR(length=100),'BKVID':sa.types.VARCHAR(length=100),'PARTNER':sa.types.VARCHAR(length=100),'BANKS':sa.types.VARCHAR(length=100),'IBAN':sa.types.VARCHAR(length=100),'ACCOUNT_ID':sa.types.VARCHAR(length=50),'CHECK_DIGIT':sa.types.VARCHAR(length=50),'ACCOUNT_TYPE':sa.types.VARCHAR(length=50),'BP_EEW_BUT0BK':sa.types.VARCHAR(length=50)})

#The logic follows for the remaining 10 Tables as well with the same settings.

#Total Time Taken: 130 seconds

*备选案文2：：使用PyODBC *

#Saving Frame 1 in Table 1
saving_query_Address='DQ_Raw_Address'
insert_to_tbl = f"INSERT INTO {saving_query_Address} VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
cursor.fast_executemany = True
cursor.executemany(insert_to_tbl, df_Addr.values.tolist())
cursor.commit()
cursor.close()
                
#Saving Frame 2 in Table 2
saving_query_Bank='DQ_Raw_Bank'
insert_to_tmp_tbl_stmt = f"INSERT INTO {saving_query_Bank} VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
cursor = conn.cursor()
cursor.fast_executemany = True
cursor.executemany(insert_to_tmp_tbl_stmt, df_Bank.values.tolist())
cursor.commit()
cursor.close()
                
#The logic follows for the remaining 10 Tables as well with the same settings.

#Total Time Taken: 200 seconds

***注意：***我已尝试将数据作为csv加载到 Dataframe 中，但到目前为止没有任何改进。无法执行BULK INSERT查询，因为我对SQLServer没有批量管理权限。另外，我需要使用VPN连接到服务器。
使用的版本：

Pandas：1.5.0，PyODBC：4.0.34，方化：1.4.42
我希望我把问题说清楚了。
非常感谢!

1条答案

按热度按时间

2vuwiymt1#

事实证明，这个问题是使用以下两种方法解决的。

**1.**使用pd.read_excel（）阅读 Dataframe 时，加载单个帧大约需要10秒。使用pd.read_csv（）可以将此时间从10秒缩短到仅需半秒。
**2.**对于存储目的，我认为TurbODBC最适合我在20秒内加载所有12帧。下面是TurbODBC的链接，它帮助我及时地将数据存储在数据库中。

https://erickfis.medium.com/etl-process-with-turbodbc-1d19ed71510e
我希望这对面临类似问题的人有所帮助。

赞(0）回复(0）举报 2023-01-15

pandas 使用带有SQLAlchemy的www.example.com _sql()将 Dataframe 插入MS SQL Server数据库python.to花费太多时间

背景：

情况：

我需要支持的地方：

1条答案

相关问题

热门标签

最新问答