omega

Documentation

First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book.

Those were transformed into a large text file see all_words.txt using the following command

grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'

Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.

For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive

awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt

Sources

  1. Generic news articles

  2. Wikipedia articles

  3. Scientific articles (Kurzgesagt)

  4. License text

  5. Books

Description
No description provided
Readme 3 MiB
Languages
Jupyter Notebook 89.9%
Python 9.9%
Shell 0.2%