BeautifulSoupを使っているコードが動かない。

2.6にすると動かない。原因は何だろう?

2.6+

[nori@asama]~/Desktop/work/tonic/bygit% sudo /usr/local/bin/easy_install-2.6 BeautifulSoup
Searching for BeautifulSoup
Best match: BeautifulSoup 3.1.0.1
Processing BeautifulSoup-3.1.0.1-py2.6.egg
BeautifulSoup 3.1.0.1 is already the active version in easy-install.pth
Installing testall.sh script to /usr/local/bin
Installing to3.sh script to /usr/local/bin

Using /usr/local/lib/python2.6/site-packages/BeautifulSoup-3.1.0.1-py2.6.egg
Processing dependencies for BeautifulSoup
Finished processing dependencies for BeautifulSoup

2.4.3

[nori@asama]~/Desktop/work/tonic/bygit% yum info python-BeautifulSoup
doing bootstrap
Loaded plugins: allowdowngrade, downloadonly, fastestmirror, kernel-module, merge-conf, priorities
Installed Packages
Name       : python-BeautifulSoup
Arch       : noarch
Version    : 3.0.7a
Release    : 3.el5
Size       : 204 k
Repo       : installed
Summary    : HTML/XML parser for quick-turnaround applications like screen-scraping
URL        : http://www.crummy.com/software/BeautifulSoup/
License    : BSD
Description: Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like
           : screen-scraping. Three features make it powerful:  Beautiful Soup won't choke if you
           : give it bad markup.  Beautiful Soup provides a few simple methods and Pythonic idioms
           : for navigating, searching, and modifying a parse tree.  Beautiful Soup automatically
           : converts incoming documents to Unicode and outgoing documents to UTF-8.  Beautiful Soup
           : parses anything you give it.  Valuable data that was once locked up in poorly-designed
           : websites is now within your reach. Projects that would have taken hours take only
           : minutes with Beautiful Soup.

実装diffしてみた。

45c45
< Copyright (c) 2004-2009, Leonard Richardson
    • -
> Copyright (c) 2004-2008, Leonard Richardson 82,83c82,83 < __version__ = "3.1.0.1" < __copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
    • -
> __version__ = "3.0.7a" > __copyright__ = "Copyright (c) 2004-2008 Leonard Richardson" 85a86 > from sgmllib import SGMLParser, SGMLParseError 90c91 < from HTMLParser import HTMLParser, HTMLParseError
    • -
1263c1131 < self.builder.feed(markup)
    • -
> SGMLParser.feed(self, markup)

Tests.pyをgrepしてもfeedが存在していない。使ってはいけないものらしい。

Release 3.1.0 (2008/12/27) http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html によると

Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't, usually to do with attribute values that aren't closed or have brackets inside them:

とのこと。orz. このへんも参照。

feedうんぬん以前に3.1.0系列を使うのはよい選択肢には思えない。lxmlに移行するかhtml5libか。lxmlのほうが開発がアクティブに見える。

あとは3.0.7aに巻き戻す。

[nori@asama]~/Desktop/temp% sudo /usr/local/bin/easy_install-2.6 http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.7.tar.gz
Downloading http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.7.tar.gz
Processing BeautifulSoup-3.0.7.tar.gz
Running BeautifulSoup-3.0.7/setup.py -q bdist_egg --dist-dir /tmp/easy_install-EqwSMK/BeautifulSoup-3.0.7/egg-dist-tmp-q8jES0
zip_safe flag not set; analyzing archive contents...
/usr/local/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/bdist_egg.py:422: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Adding BeautifulSoup 3.0.7 to easy-install.pth file

Installed /usr/local/lib/python2.6/site-packages/BeautifulSoup-3.0.7-py2.6.egg
Processing dependencies for BeautifulSoup==3.0.7
Finished processing dependencies for BeautifulSoup==3.0.7

あっさり動いた。が。しかし3+にあげるつもりならこれではだめだ。いまlxmlを使って対応するには時間が無さ過ぎる。やれやれ