REPLACEMENT CHARACTER - Noriのいろいろ

http://d.hatena.ne.jp/den2sn/20080819/1219075630

w = u'はなもげら'
query = db.GqlQuery('SELECT * FROM Test WHERE word >= :1 and word < :2', w, w + u'\uFFFD')

ごーいんでねーか？そもそもreplacement characterに順番がある保証なんてないし。Byte列としてFFFE(BOM, Byte Order Mark、正確にはFEFFがBOMなのでそのendian違い)とFFFF(使われていない何か)しか後になさそうだから動いちゃっているのかもしれないが。

http://suika.fam.cx/~wakaba/wiki/sw/n/U%2BFFFE)

U+FFFE. This noncharacter has the intended peculiarity that, when represented in UTF-16 and then serialized, it has the opposite byte sequence of U+FEFF, the byte order mark. This means that applications should reserve U+FFFE as an internal signal that a UTF-16 text stream is in a reversed byte format. Detection of U+FFFE at the start of an input stream should be taken as a strong indication that the input stream should be byte-swapped before interpretation. For more on the use of the byte order mark and its interaction with the noncharacter U+FFFE, see Section 16.8, Specials.

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF16. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF16. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.

http://www5d.biglobe.ne.jp/~stssk/rfc/rfc3454j.utf8.html

5.6 Inappropriate for plain text
5.6 平文に不適当
The following characters do not appear in regular text.
次の文字は通常のテキストに現われません。
FFF9; INTERLINEAR ANNOTATION ANCHOR
FFFA; INTERLINEAR ANNOTATION SEPARATOR
FFFB; INTERLINEAR ANNOTATION TERMINATOR
FFFC; OBJECT REPLACEMENT CHARACTER
Although the replacement character (U+FFFD) might be used when a
string is displayed, it doesn't make sense for it to be part of the
string itself. It is often displayed by renderers to indicate "there
would be some character here, but it cannot be rendered". For
example, on a computer with no Asian fonts, a string with three
ideographs might be rendered with three replacement characters.
置換文字(U+FFFD)が文字列表示時に使われるかもしれないが、これは文字列
の一部である意味をなしません。これは「ここにある文字があるが、それを
表現できません」ということを示すためにしばしば表示されます。例えば、
アジアフォントがないコンピュータ上で、３つの象形文字の文字列が３つの
置換文字で表示されるかもしれません。
FFFD; REPLACEMENT CHARACTER

http://ja.wikipedia.org/wiki/Unicode%E4%B8%80%E8%A6%A7_F000-FFFF
の一番下。FFFEとFFFFは不使用。

google公認のtipsだった。http://code.google.com/intl/ja/appengine/docs/python/datastore/queriesandindexes.html#Restrictions_on_Queries

Tip: Query filters do not have an explicit way to match just part of a string value, but you can fake a prefix match using inequality filters:
db.GqlQuery("SELECT * FROM MyModel WHERE prop >= :1 AND prop < :2", "abc", u"abc" + u"\ufffd")
This matches every MyModel entity with a string property prop that begins with the characters abc. The unicode string u"\ufffd" represents the largest possible Unicode character. When the property values are sorted in an index, the values that fall in this range are all of the values that begin with the given prefix.

全文検索したい場合。 http://d.hatena.ne.jp/matsuza/20080419/1208625514