爬虫实战——房天下新房信息爬取selenium+Chrome

房天下新房信息爬取

引言

本次爬虫使用了selenium,结合Chrome浏览器进行信息爬取,在数据存储方面,用了MongoDB数据库
特别声明
代码仅供交流学习,不要用来做违法的事情。

思路分析

起始URL:'https://www.fang.com/SoufunFamily.htm
通过此URL,我们可以获取到全国各大城市的URL,然后,通过获取到的城市URL进行URL拼接,就可以得到每个城市新房页面的URL。由于每个页面还有下一页,可以用XPath定位到下一页的按钮,然后模拟点击就可以来到下一个页面了,再继续提取这个页面的信息即可。重复上述步骤,就可以完成新房信息的爬取。具体代码如下:

完整代码

# !/usr/bin/env python
# —*— coding: utf-8 —*—
# @Time:    2020/2/6 21:00
# @Author:  Martin
# @File:    房天下.py
# @Software:PyCharm
import requests
import re
import time
import pymongo
from lxml import etree
from selenium import webdriver
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By


class FangSpider(object):
    def __init__(self):
        self.driver = webdriver.Chrome(executable_path='./chromedriver.exe')
        self.start_url = 'https://www.fang.com/SoufunFamily.htm'
        self.client = pymongo.MongoClient(host="localhost", port=27017)
        self.db = self.client['fangtianxia']

    def run(self):
        self.driver.get(self.start_url)
        webdriverwait(driver=self.driver, timeout=10).until(
            EC.presence_of_element_located((By.XPATH, '//table[@class="table01"]/tbody/tr'))
        )
        source = self.driver.page_source
        html = etree.HTML(source)
        trs = html.xpath('//table[@class="table01"]/tbody/tr')[:-1]
        for tr in trs:
            a_list = tr.xpath('./td[3]/a')
            for a in a_list:
                city = a.xpath('./text()')[0]
                city_url = a.xpath('./@href')[0]
                city_name = city_url.split("//")[1].split(".")[0]
                newhouse_url = "http://" + city_name + ".newhouse.fang.com/house/s/"
                if city == "北京":
                    newhouse_url = 'https://newhouse.fang.com/house/s/'
                self.parse_newhouse(newhouse_url, city)
                time.sleep(1)

    def parse_newhouse(self, newhouse_url, city):
        city = city
        self.driver.get(newhouse_url)
        webdriverwait(driver=self.driver, timeout=10).until(
            EC.presence_of_element_located((By.XPATH, '//div[@id="newhouse_loupai_list"]/ul/li'))
        )
        while True:
            time.sleep(1)
            source = self.driver.page_source
            self.get_newhouse_info(source, city)
            try:
                btn = self.driver.find_element_by_xpath('//div[@class="page"]/ul/li[2]/a[@class="next"]')
                btn.click()
            except :
                break

    def get_newhouse_info(self, source, city):
        city = city
        html = etree.HTML(source)
        li_list = html.xpath('//div[@id="newhouse_loupai_list"]/ul/li')
        for li in li_list:
            name = "".join(li.xpath('.//div[@class="nlc_details"]/div[1]/div/a/text()')).strip()
            origin_url_list = li.xpath('.//div[@class="nlc_details"]/div[1]/div/a/@href')
            if origin_url_list:
                origin_url = "http:" + origin_url_list[0]
            else:
                origin_url = ""
            a_list = li.xpath('.//div[@class="nlc_details"]/div[2]/a')
            room_type = ""
            for a in a_list:
                text = a.xpath('./text()')
                if text:
                    if text[0].endswith("居"):
                        room_str = text[0]
                    else:
                        room_str = ""
                else:
                    room_str = ""
                room_type += room_str
            area = "".join(li.xpath('.//div[@class="nlc_details"]/div[2]/text()'))
            area = re.sub(r'\s', "", area).replace("/", "").replace("-", "")
            address = li.xpath('.//div[@class="nlc_details"]/div[3]/div/a/@title')
            if address:
                address = address[0]
            else:
                address = ""
            price = li.xpath('.//div[@class="nhouse_price"]/span/text()') + li.xpath('.//div[@class="nhouse_price"]/em/text()')
            if len(price) == 2:
                price = price[0] + price[1]
            else:
                price = ""
            sale = li.xpath('.//div[contains(@class,"fangyuan")]/span/text()')
            if sale:
                sale = sale[0]
            else:
                sale = ""
            label_list = li.xpath('.//div[contains(@class,"fangyuan")]/a')
            label = ""
            for a in label_list:
                text = a.xpath('./text()')
                if text:
                    text = text[0]
                else:
                    text = ""
                label += text
            house = {
                'city': city,
                'name': name,
                'room_type': room_type,
                'area': area,
                'price': price,
                'sale': sale,
                'label': label,
                'address': address,
                'origin_url': origin_url
            }
            print(house)
            self.save(house)

    def save(self, info):
        self.db.fangtianxia.insert_one(info)

    def close(self):
        self.client.close()


if __name__ == '__main__':
    spider = FangSpider()
    spider.run()
    spider.close()

结果展示

爬虫没有运行完,我就停止了,大约运行了5~6分钟吧!我看了一下数据库已经有上千条信息了。

在这里插入图片描述

总结

采用selenium进行页面爬取时,由于需要浏览器,整体来说,效率是比较低的,但是在找不到网站的数据接口或者网站页面采用了js混淆时,用selenium也算是一个不错的方法吧!

虐猫人薛定谔i 发布了142 篇原创文章 · 获赞 223 · 访问量 3万+ 私信 关注

相关文章

转载地址:https://www.cnblogs.com/mini-monkey/p/12104821...
web自动化测试过程中页面截图相对比较简单,可以直接使用sel...
目录前言一、Selenium简介二、浏览器驱动1.浏览器驱动参考2....
一、iframe的含义:iframe是HTML中框架的一种形式,在对界面...
转载请注明出处❤️作者:测试蔡坨坨原文链接:caituotuo.to...
'''##**认识selenium**​**下载:pipinstall...