[NTLK] Dictionary transplant
larry.yaeger at gmail.com
Thu May 5 16:05:59 PDT 2022
On May 5, 2022, at 6:14 AM, NewtonTalk <newtontalk at pda-soft.de> wrote:
> To the best of my knowledge the eMate can use either character recognition
> or word recognition when trying to decipher one's handwriting. Word
> recognition requires the built-in dictionary. Is this correct so far?
Sort of, but to be more precise, the Print Recognizer always used *both* character recognition and word recognition, though one of the dictionaries is basically any symbol anywhere, but just at a low probability, so you can write outside the dictionaries. (Unlike the first gen HWR, that only allowed words in dictionaries.) I suspect second gen Paragraph / Mixed-Cursive recognizer behaves similarly, but we were never allowed to look at their code.
There is also a probability of word break introduced into the analysis, based on the space between strokes/letters. This may be language independent, hence not an issue for you.
> Transplanting the dictionary as a whole from a German 2x00's ROM to the
> eMate ROM would make a lot of sense for a German eMate OS. For that I'd need
> to know (at least) two things:
> a) at which address in the ROM the dictionary begins and
> b) how big it is?
> Is the dictionaries' size always the same regardless of the language? Is it
> in one consecutive block, or are dictionary chunks spread all over the ROM?
> Does such a block transfer make sense at all, or would the different word
> lengths prevent the search algorithm from working in the first place?
I'm afraid I can't provide much help on this front. A few tidbits...
None of these dictionaries were simple lists of words, at least not for the Print Recognizer. They were specially compiled letter graphs that shared edges as much as possible.
There isn't just one dictionary per language. If I remember right there were common words, less common words, names, and places dictionaries. Though details are fuzzy. Names and places might have been combined. And even though I'm sure we had the dual word lists at one point, I'm not sure that's what shipped.
There are also specially tailored time, date, and telephone number dictionaries. These symbol graphs (mixing numbers, some letters, some punctuation) were hand-crafted, grep-like lists of expressions initially that were then compiled into the symbol graphs. Different nations nominally express dates, times, and telephone numbers differently, so these were also necessarily specialized for each supported language/region.
There may be other dictionaries, like a punctuation-specific dictionary.
There was even a separate dictionary of "bad" words. This dictionary contained curse words and the like, that could never be offered as an alternative guess (it would have been considered bad form to prompt a person who had written "tuck" suggesting they may have intended to write "fuck"). But these words were treated as just another word to the recognizer, so people could accurate write what they wanted. I'm not sure how complete this dictionary was in any language but English, or how it was compiled, or even if it exists for certain, but I know there was such a thing in English.
Because of the compiled, symbol-graph nature of these dictionaries, and the expectation that there would be multiple, specific dictionaries, you'd need to copy all of them and figure out how they are pointed to. I'm afraid I don't recall exactly how we accessed those dictionaries. I *think* there was a NewtonOS API, possibly internal only, that we called to get pointers to these dictionaries. If that code has been identified and parsed you might succeed in what you were hoping to attempt, but without that it would be exceedingly difficult.
More information about the NewtonTalk