Journal

Douban Book Rating Bulk Scraper Script in Python

2011·07·15 #Works

Machine-translated from Chinese. · Read original

This script is used to crawl book rating information based on the ISBN number of a book. data.csv is a list file containing the ISBN numbers of books, with each line representing the ISBN number of a book. This version only uses a single thread to crawl and can only read data from a CSV file. Since it was requested by a friend and the data volume is not large, I will improve it gradually if needed in the future. The entire Python script is simple, mainly using BeautifulSoup to extract HTML content.

P.S. I have several interesting projects in hand recently, hoping to complete them as soon as possible :)

import urllib,urllib2
import re
import BeautifulSoup

def isbn_2_score(isbn):
    url = 'http://www.douban.com/subject_search?search_text='
    try:
        response = urllib2.urlopen(url+isbn)
    except Exception,e:
        return 0.0
    doc = response.read()
    soup = BeautifulSoup.BeautifulSoup(''.join(doc))
    try:
        book_info = soup.find("a",{"class":"nbg"})
    except Exception,e:
        return 0.0
    if isinstance(book_info,BeautifulSoup.Tag):
        url_book_info = book_info['href']
        try:
            response = urllib2.urlopen(url_book_info)
        except Exception,e:
            return 0.0
        book_page = response.read()
        soup = BeautifulSoup.BeautifulSoup(''.join(book_page))
        score_info = soup.find('strong','ll rating_num')
        if isinstance(book_info,BeautifulSoup.Tag):
            score = score_info.string
            return score
        return 0.0
    return 0.0

def read_file(file_name):
    file_handler = open(file_name,'r')
    return file_handler

def return_isbn(file_handler):
    isbn = file_handler.readline()
    return isbn


if __name__ == '__main__':
data = read_file('data.csv')
	f = open('dump','w')
	k = return_isbn(data)
	while k is not None:
    	score = isbn_2_score(k)
    	result = k[0:-1]+":"+str(score)+"\n"
    	print result
    	f.write(result)
    	k=return_isbn(data)
	f.close()

Project address: https://github.com/quake0day/douban_crawler