问题描述
我有一个充满 CSV 的目录,需要将其导入 sql Server 数据库的不同表中。幸运的是,附加 CSV 的文件名以字符串“Concat_AAAAA_XX...”开头,其中 AAAAA 部分是一个字母数字字符串,后跟 XX,它是一个双整数。两者都充当 sql 中特定表的键。
我的问题是,创建一个 Python 脚本的最优雅的方法是从每个文件名中获取 AAAAA 和 XX 值,并知道要将数据导入哪个表?
CSV1 named: Concat_T101_14_20072021.csv
would need to be imported into Table A
CSV2 named: Concat_RB728_06_25072021.csv
would need to be imported into Table B
CSV3 named: Concat_T144_21_27072021.csv
would need to be imported into Table C
and so on...
我已经读到 ConfigParser 包可能会有所帮助,但无法理解如何在此处应用其理论。建议使用 ConfigParser 的原因是因为我希望拥有灵活性或编辑配置文件(例如“CONfig.INI”),而不必将新条目硬编码到 python 脚本中。
到目前为止,我的代码只适用于一个独立的数据集,可以在 here 中找到。
这是我正在使用的代码:
import pypyodbc as odbc
import pandas as pd
import os
os.chdir('sql Loader')
df = pd.read_csv('Real-Time_Traffic_Incident_Reports.csv')
df['Published Date'] = pd.to_datetime(df['Published Date']).dt.strftime('%Y-%m-%d %H:%M:%s')
df['Status Date'] = pd.to_datetime(df['Published Date']).dt.strftime('%Y-%m-%d %H:%M:%s')
df.drop(df.query('Location.isnull() | Status.isnull()').index,inplace=True)
columns = ['Traffic Report ID','Published Date','Issue Reported','Location','Address','Status','Status Date']
df_data = df[columns]
records = df_data.values.tolist()
DRIVER = 'sql Server'
SERVER_NAME = 'MY SERVER'
DATABASE_NAME = 'MYDATABASE'
def connection_string(driver,server_name,database_name):
conn_string = f"""
DRIVER={{{driver}}};
SERVER={server_name};
DATABASE={database_name};
Trust_Connection=yes;
"""
return conn_string
try:
conn = odbc.connect(connection_string(DRIVER,SERVER_NAME,DATABASE_NAME))
except odbc.DatabaseError as e:
print('Database Error:')
print(str(e.value[1]))
except odbc.Error as e:
print('Connection Error:')
print(str(e.value[1]))
sql_insert = '''
INSERT INTO Austin_Traffic_Incident
VALUES (?,?,GETDATE())
'''
try:
cursor = conn.cursor()
cursor.executemany(sql_insert,records)
cursor.commit();
except Exception as e:
cursor.rollback()
print(str(e[1]))
finally:
print('Task is complete.')
cursor.close()
conn.close()
解决方法
您可以使用 dict
之类的
import re
from glob import glob
translation_table = {
'14': 'A','06': 'B','21': 'C'
}
# get all csv files from current directory
for filename in glob("*.csv"):
# extract the file number with a regular expression
# (can also be done easily with split function)
filenum = re.match(r"^Concat_([0-9]+)_[0-9]{8}.csv$",filename).group(1)
# use the translation table to get the table name
tablename = translation_table[filenum]
print(f"Data from file '{filename}' goes to table '{tablename}'")
,
我想说有多种方法可以做这种事情。您可以使用纯 SQL,我将在下面进行说明,也可以使用 Python。如果你想要一个 Python 解决方案,只需回帖,我会提供代码。有些人不喜欢人们推荐他们在原始帖子中列出的特定技术之外的解决方案。所以,这里是 SQL 解决方案。
DECLARE @intFlag INT
SET @intFlag = 1
WHILE (@intFlag <=48)
BEGIN
PRINT @intFlag
declare @fullpath1 varchar(1000)
select @fullpath1 = '''\\source\FTP1\' + convert(varchar,getdate()- @intFlag,112) + '_SPGT.SPL'''
declare @cmd1 nvarchar(1000)
select @cmd1 = 'bulk insert [dbo].[table1] from ' + @fullpath1 + ' with (FIELDTERMINATOR = ''\t'',FIRSTROW = 5,ROWTERMINATOR=''0x0a'')'
exec (@cmd1)
-------------------------------------------
declare @fullpath2 varchar(1000)
select @fullpath2 = '''\\source\FTP2\' + convert(varchar,getdate()-@intFlag,112) + '_SPBMI_GL_PROP_USD_C.SPL'''
declare @cmd2 nvarchar(1000)
select @cmd2 = 'bulk insert [dbo].[table2] from ' + @fullpath2 + ' with (FIELDTERMINATOR = ''\t'',ROWTERMINATOR=''0x0a'')'
exec (@cmd2)
-------------------------------------------
declare @fullpath3 varchar(1000)
select @fullpath3 = '''\\source\FTP3\' + convert(varchar,112) + '_SPBMI_GL_PROP_USD_C_ADJ.SPC'''
declare @cmd3 nvarchar(1000)
select @cmd3 = 'bulk insert [dbo].[table3] from ' + @fullpath3 + ' with (FIELDTERMINATOR = ''\t'',FIRSTROW = 7,ROWTERMINATOR=''0x0a'')'
exec (@cmd3)
-------------------------------------------
declare @fullpath4 varchar(1000)
select @fullpath4 = '''\\source\FTP4\' + convert(varchar,112) + '_SPGTINFRA_ADJ.SPC'''
declare @cmd4 nvarchar(1000)
select @cmd4 = 'bulk insert [dbo].[table4] from ' + @fullpath4 + ' with (FIELDTERMINATOR = ''\t'',ROWTERMINATOR=''0x0a'')'
exec (@cmd4)
SET @intFlag = @intFlag + 1
END
GO
这是您要求的 Python 解决方案。
当然,Python 解决方案更简单。
import pyodbc
engine = "mssql+pyodbc://server_name/db_name?driver=SQL Server Native Client 11.0?trusted_connection=yes"
for f in all_files:
# load each file into each dataframe...something like...
df = pd.read_csv(f,delimiter='\t',skiprows=0,header=[0])
# all_df[x].append(df) ... you may or may not need to append ...depends on your setup
# depends on your setup...
df.to_sql(table_name,engine,if_exists='replace',index=True,chunksize=100000)