为什么我的 MySQL 数据库不包含从 Scrapy 发送给它的所有数据?

问题描述

我是一名 sql 新手,正在尝试使用 python 将大量数据存储到 MysqL 数据库中。出于某种原因,发送约后。我的数据库中有 24,000 行数据,我发现它只包含 1,300 行。

我的硬盘未满。

我在推送数据时没有收到错误(我使用 python)。

这可能与存储引擎 InnoDB 有关,但考虑到这 1,300 行占用 176 KB,我对此表示怀疑。这可能是最后一点,因为我无法理解文档,因为它更多地讨论了以字节和页为单位的数据大小方面的限制,而不是行数,我无法理解。

我从python处理数据库时使用的语句如下。

  1. 数据库创建

    CREATE DATABASE database_name
    
  2. 表创建

     "CREATE TABLE table_name ( \
     id INT PRIMARY KEY,\
     price INT,\
     model VARCHAR(40),\
     year INT,\
     body VARCHAR(30),\
     milage INT,\
     engine_size FLOAT,\
     engine_power INT,\
     transmission VARCHAR(10),\
     fuel_type VARCHAR(30),\
     owners INT,\
     ultra_low_emission_zone INT,\
     service_history VARCHAR(30),\
     first_year_road_tax INT,\
     full_manufacturer_warranty INT \
     );
    
  3. 推送数据到数据库

     "INSERT INTO table_name VALUES\
     ("+str(id_carrier.carried_id + counter)+",\
     "+price+",\
     '"+model+"',\
     "+year+",\
     '"+body+"',\
     "+milage+",\
     "+engine_size+",\
     "+engine_power+",\
     '"+transmission+"',\
     '"+fuel_type+"',\
     "+owners+",\
     "+ultra_low_emission_zone+",\
     '"+service_history+"',\
     "+first_year_road_tax_included+",\
     "+manufacturer_warranty+" \
     );"
    

enter image description here

任何帮助将不胜感激。

编辑 1:添加代码用于处理我的数据。数据库的条目是一行一行的。

counter = 0

# retrieving offers
offers = response.xpath('//li[@class = "search-page__result"]')[1:-2]
for offer in offers:
    # reinitializing the data which can be missing
    year = '0'
    body = 'unlisted'
    milage = '1000000'
    engine_size = '50'
    engine_power = '1000000'
    transmission = 'unlisted'
    fuel_type = 'unlisted'
    owners = '100'
    ultra_low_emission_zone = '0'
    service_history = 'unlisted'
    first_year_road_tax_included = '0'
    manufacturer_warranty = '0'
    # getting price of offer
    price = Selector(text=offer.extract()).xpath('//div[@class = "product-card-pricing__price"]//span/text()').get()
    # formatting price of offer
    price = price.replace(',','').replace('£','')
    # getting offer model
    model = Selector(text=offer.extract()).xpath('//h3[@class = "product-card-details__title"]/text()').get()
    # formatting model
    model = model.replace('\n','').replace('BMW ','').strip().lower()
    # going through some clustered data and applying formatting
    clustered_details = Selector(text=offer.extract()).xpath('//li[@class = "atc-type-picanto--medium"]/text()').getall()
    for detail in clustered_details:
        if 'reg' in detail.lower():
            year = detail.split(' ')[0]
            continue
        elif detail.lower() == 'convertible' or \
                detail.lower() == 'coupe' or \
                detail.lower() == 'estate' or \
                detail.lower() == 'hatchback' or \
                detail.lower() == 'mpv' or \
                detail.lower() == 'suv' or \
                detail.lower() == 'saloon':
            body = detail.lower()
            continue
        elif 'miles' in detail:
            milage = detail.lower().replace(','').replace(' miles','')
            continue
        elif detail[0] in '0123456' and detail[1] =='.' and detail[2] in '0123456':
            engine_size = detail.lower().replace('l','')
            continue
        elif detail[0].isnumeric() == True and detail[1].isnumeric() == True and 'p' in detail.lower():
            engine_power = first_number(detail)
            continue
        elif detail.lower() == 'manual' or detail.lower() == 'automatic':
            transmission = detail.lower()
            continue
        elif detail.lower() == 'diesel' or \
                detail.lower() == 'diesel hybrid' or \
                detail.lower() == 'diesel plug-in hybrid' or \
                detail.lower() == 'electric' or \
                detail.lower() == 'petrol' or \
                detail.lower() == 'petrol hybrid' or \
                detail.lower() == 'petrol plug-in hybrid':
            fuel_type = detail.lower()
            continue
        elif detail.lower() == 'full service history':
            service_history = 'full service history'
            continue
        elif detail.lower() == 'part non dealer' or detail.lower() == 'part service history':
            service_history = 'part service history'
            continue
        elif detail.lower() == 'full dealership history' or detail.lower() == 'full dealer':
            service_history = 'full dealership history'
            continue
        elif detail.lower() == 'ulez':
            ultra_low_emission_zone = '1'
            continue
        elif 'owner' in detail.lower():
            owners = detail.lower().split(' ')[0]
            continue
        elif detail.lower() == 'first year road tax included':
            first_year_road_tax_included = '1'
            continue
        elif detail.lower() == 'full manufacturer warranty':
            manufacturer_warranty = '1'
            continue
        else:
            print('Unexpected value ',detail)
            exit()
    counter += 1
    insert_query = "INSERT INTO " +make+ " VALUES\
        ("+str(id_carrier.carried_id + counter)+",\
            "+price+",\
            '"+model+"',\
            "+year+",\
            '"+body+"',\
            "+milage+",\
            "+engine_size+",\
            "+engine_power+",\
            '"+transmission+"',\
            '"+fuel_type+"',\
            "+owners+",\
            "+ultra_low_emission_zone+",\
            '"+service_history+"',\
            "+first_year_road_tax_included+",\
            "+manufacturer_warranty+" \
        );"
    db.execute_query(insert_query,connection)
    # print(price,model,year,body,milage,engine_size,engine_power,transmission,fuel_type,owners,ultra_low_emission_zone,service_history,\
    #      first_year_road_tax_included,manufacturer_warranty)

id_carrier.carried_id = id_carrier.carried_id + counter
try:
    next_page = response.xpath('//a[@class = "pagination--right__active"]/@data-paginate')[0].root
except IndexError:
    print("All the pages have been scraped")
    exit()
url = ".."+next_page
time.sleep(3 + random.uniform(0,4))
yield scrapy.Request(url=url,callback=self.parse,headers = header)

编辑 2:为编辑 1 中的代码中不可见的函数添加数据库管理代码

import MysqL.connector
from MysqL.connector import Error
import pandas as pd

def create_server_connection(host_name,user_name,user_password,db_name = None):
    connection = None
    if db_name != None:
        try:
            connection = MysqL.connector.connect(
                host=host_name,user=user_name,passwd=user_password,database=db_name
            )
            print("Connection to database " + db_name + " established.")
        except Error as err:
            if err.errno != 1049:
                print(f"Error: '{err}'")
                exit()
            print("Requested database does not exist. Creating it.")
            connection = MysqL.connector.connect(
                host=host_name,passwd=user_password
            )
            create_database_query = "CREATE DATABASE " + db_name
            create_database(create_database_query,connection)
            connection = MysqL.connector.connect(
                host=host_name,database=db_name
            )
            print("Connection to database " + db_name + " established.")

    return connection

def create_database(query,connection):
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        print("Database created successfully.")
    except Error as err:
        print(f"Error: '{err}'")

def execute_query(query,connection):
    cursor = connection.cursor()
    try:
        cursor.execute(query)
        connection.commit()
    except Error as err:
        if err.errno != 1050:
            print(f"Error: '{err}'")
            exit()
        print("Table " + query.split('TABLE')[1][1:].split(' ')[0] + " exists. Continuing script.")

解决方法

我猜(但我不确定)您正在尝试对其值列表中的所有 24k 行执行一个 gigundo INSERT 操作。

MySql 对语句长度有(长)限制。通常没问题,但它可能截断了您的大量 INSERT。

尝试以 100 行为一组进行操作。

编辑:感谢您对数据流的说明。 MySQL 的 python 连接器不会 automatically commit INSERT 和 UPDATE。

这意味着 MySQL 服务器会在 transaction 中累积您的更改。 24K 行对于一个事务来说是很多行,有可能超过了事务缓冲空间。

因此,在每插入 100 行左右后,您应该do this

cnx.commit()

或者,您可以在设置连接时执行 cnx.autocommit = True。但是通过逐一提交行进行批量加载非常慢。

python 连接器与大多数其他语言连接器的不同之处在于它默认不执行自动提交。这令人困惑。

,

我要感谢 Chris Schaller 和 O. Jones 对我的帮助。在他们的指导下,我设法将问题调试到了爬虫网络爬虫。文档 here 和 StackOverflow 帖子 here 中详细说明了每个响应的最大操作数为 100。