omega

Documentation

First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book.

Those were transformed into a large text file see all_words.txt using the following command

grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'

Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.

For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive

awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt

README.md

omega

Documentation

Sources