用python抓取oj题目（1）——用beautifulsoup分析oj元素

终于搞完了记录一下

　　我的任务是hdoj和toj这两个，事实上也就一个。做hdoj用了4天的样子，toj一上午就ok了、、、所以撇开toj，直接用hdoj的东西来说。也就是肿么把oj上这些字儿啊图片啊神马的抓下来存到数据库的。当然，为了验证是否正确，django这个方便的东西是不能少的。

　　btw：原来django的静态文件是这么个意思啊，这个以后再说、、、

　　首先点开杭电的网址，找到problem archive，进来之后看题目http://acm.hdu.edu.cn/listproblem.php?vol=1，一堆啊，随便点一个题，比如1056（让我很纠结的一个题）http://acm.hdu.edu.cn/showproblem.php?pid=1056，1057http://acm.hdu.edu.cn/showproblem.php?pid=1057，第一件需要做的就是分析这个页面的元素。为嘛那，要知道这些个玩意儿是早晚都要存到数据库里面的，所以首先要看看建的表里面会有那些个列，而且还要看不同题号的题目有那些是相同的东西，写个函数一劳永逸。so，打开火狐或者是chrome的firebugs，可以看到类似这样子的。

看看界面里面，貌似题目里面会有1.title 2.limit des 3.problem des 4.input 5.output 6.sample input 7.sample output 8.hint 9.author 10.source 11.recommend 12.imgages。一开始的时候我以为前5项是一定会有的，对啊，肯定会有标题，限制描述，问题描述，输入输出吧，直到我第一次写完之后遇到了奇葩的1056题，这个题竟然没有input，output啊我去，当时我是从第1000题往2000题抓，但是每次到1056的时候，python就给了我一个异常，然后就跪了。我还没弄明白神马事儿的，到处查后来看了看1056，哎，这样啊、、、

　　所以，不要绝对相信一些个东西、、、

　　后来，求助了下学长，他以前做过类似的这种抓oj题的东西，给了我一个图，狠好啊，不敢独享，传上来先，学长是万能的～。当然，我现在的任务只需要看problem那一列。

ok，最后发现杭电所有的题目都是 http://acm.hdu.edu.cn/showproblem.php?pid= 加上一个题号（4位），估计oj们也都是用数据库存的。
　　好吧，下面开始对照代码来说说BeautifulSoup是肿么分析网页的。
　　先上代码：
1 #! -*- encoding:utf-8 -*-
2 import urllib2
3 import traceback
4 from BeautifulSoup import BeautifulSoup
5 from sqlalchemy import *
6 from sqlalchemy.orm import *
7
8 def catch(url=None, pro_image='/images/hdoj/'):
9 ##        """ return 12 infos
10 ##        1.title 2.limit des 3.problem des 4.input 5.output
11 ##        6.sample input 7.sample output 8.hint 9.author
12 ##        10.source 11.recommend 12.imgages
13 ##        the last element is a list of images """
14     content_stream = urllib2.urlopen(url)
15     content = content_stream.read()
16     print 'catching: ' + url
17     soup = BeautifulSoup(content)
18     table = soup.table
19
20     #images the real url
21     images_src = table.findAll('img')[1:]
22     images = []
23
24     len_img = len(images_src)
25
26     for i in range(len_img):
27         image = str(images_src[i].attrs[0][1])
28         images.append(image)
29
30     # now we change the images url
31
32     for i in range(len_img):
33         images_src[i]['src'] = pro_image + images_src[i].attrs[0][1].split('/')[-1]
34
35     #title
36     table_title = table.find('h1')
37     table_title.hidden = True
38     #title below limits description
39     table_limit_des = table_title.findNext('span')
40     table_limit_des.hidden = True
41     # problem description, input, output, sample input, sample output
42     try:
43         table_problem_des = table.find(text='Problem Description').findNext('div', {'class':'panel_content'})
44         table_problem_des.hidden = True
45     except Exception as e:
46         table_problem_des = None
47
48     #input
49     try:
50         table_input = table.find(text='Input').findNext('div', {'class':'panel_content'})
51         table_input.hidden = True
52     except Exception as e:
53         table_input = None
54     #output
55     try:
56         table_output = table.find(text='Output').findNext('div', {'class':'panel_content'})
57         table_output.hidden = True
58     except Exception as e:
59         table_output = None
60     #sample input
61     try:
62         table_sample_input = table.find(text='Sample Input').findNext('div', {'class':'panel_content'})
63         table_sample_input.hidden = True
64     except Exception as e:
65         table_sample_input = None
66     #sample output
67     try:
68         table_sample_output = table.find(text='Sample Output').findNext('div', {'class':'panel_content'})
69         table_sample_output.hidden = True
70     except Exception as e:
71         table_sample_output = None
72
73     # hint
74     try:
75         table_hint = table_sample_output.i.next.next
76     except Exception as e:
77         table_hint = None
78     try:
79         table_sample_output = table_sample_output.i.previous.previous.previous
80     except Exception as e:

补充：Web开发 , Python ,