Mail Archives: djgpp/1997/12/08/09:33:15
On Fri, 5 Dec 1997, ron aaron wrote:
> I have a Hebrew text corpus which I would like to index with mkid/lid. I
> have been able to mkid ok, and lid ".*" dumps all the tokens, but I can't do
> 'lid (hebrew text)'.
If I'm not mistaken, the tokens you see dumped by `lid' do NOT include
Hebrew words. Is that true? If so, that's because ID-Utils do not
treat characters with ASCII codes between 128 and 192 as word
characters. It should be a simple matter to change ID-Utils and
recompile them so that they do support such characters. (I can
provide the necessary details, if you want.)
- Raw text -