omega
Documentation
First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book.
Those were transformed into a large text file see all_words.txt using the following command
grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'
Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.
For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive
awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt
Sources
-
Generic news articles
-
Wikipedia articles
-
Scientific articles (Kurzgesagt)
-
License text
-
Books