BeautifulSoupを使っているコードが動かない。
2.6にすると動かない。原因は何だろう?
2.6+
[nori@asama]~/Desktop/work/tonic/bygit% sudo /usr/local/bin/easy_install-2.6 BeautifulSoup Searching for BeautifulSoup Best match: BeautifulSoup 3.1.0.1 Processing BeautifulSoup-3.1.0.1-py2.6.egg BeautifulSoup 3.1.0.1 is already the active version in easy-install.pth Installing testall.sh script to /usr/local/bin Installing to3.sh script to /usr/local/bin Using /usr/local/lib/python2.6/site-packages/BeautifulSoup-3.1.0.1-py2.6.egg Processing dependencies for BeautifulSoup Finished processing dependencies for BeautifulSoup
2.4.3
[nori@asama]~/Desktop/work/tonic/bygit% yum info python-BeautifulSoup doing bootstrap Loaded plugins: allowdowngrade, downloadonly, fastestmirror, kernel-module, merge-conf, priorities Installed Packages Name : python-BeautifulSoup Arch : noarch Version : 3.0.7a Release : 3.el5 Size : 204 k Repo : installed Summary : HTML/XML parser for quick-turnaround applications like screen-scraping URL : http://www.crummy.com/software/BeautifulSoup/ License : BSD Description: Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like : screen-scraping. Three features make it powerful: Beautiful Soup won't choke if you : give it bad markup. Beautiful Soup provides a few simple methods and Pythonic idioms : for navigating, searching, and modifying a parse tree. Beautiful Soup automatically : converts incoming documents to Unicode and outgoing documents to UTF-8. Beautiful Soup : parses anything you give it. Valuable data that was once locked up in poorly-designed : websites is now within your reach. Projects that would have taken hours take only : minutes with Beautiful Soup.
実装diffしてみた。
45c45 < Copyright (c) 2004-2009, Leonard Richardson
-
- -
-
- -
-
- -
-
- -
Tests.pyをgrepしてもfeedが存在していない。使ってはいけないものらしい。
Release 3.1.0 (2008/12/27) http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html によると
Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't, usually to do with attribute values that aren't closed or have brackets inside them:
とのこと。orz. このへんも参照。
- http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
- http://stackoverflow.com/questions/459552/beautifulsoup-3-1-parser-breaks-far-too-easily
feedうんぬん以前に3.1.0系列を使うのはよい選択肢には思えない。lxmlに移行するかhtml5libか。lxmlのほうが開発がアクティブに見える。
あとは3.0.7aに巻き戻す。
[nori@asama]~/Desktop/temp% sudo /usr/local/bin/easy_install-2.6 http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.7.tar.gz Downloading http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.7.tar.gz Processing BeautifulSoup-3.0.7.tar.gz Running BeautifulSoup-3.0.7/setup.py -q bdist_egg --dist-dir /tmp/easy_install-EqwSMK/BeautifulSoup-3.0.7/egg-dist-tmp-q8jES0 zip_safe flag not set; analyzing archive contents... /usr/local/lib/python2.6/site-packages/setuptools-0.6c9-py2.6.egg/setuptools/command/bdist_egg.py:422: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal Adding BeautifulSoup 3.0.7 to easy-install.pth file Installed /usr/local/lib/python2.6/site-packages/BeautifulSoup-3.0.7-py2.6.egg Processing dependencies for BeautifulSoup==3.0.7 Finished processing dependencies for BeautifulSoup==3.0.7
あっさり動いた。が。しかし3+にあげるつもりならこれではだめだ。いまlxmlを使って対応するには時間が無さ過ぎる。やれやれ