# omega ## Documentation First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book. Those were transformed into a large text file [see all_words.txt](data/all_words.txt) using the following command ``` grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]' ``` Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase. For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive ``` awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt ``` ## Sources 1. Generic news articles - https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html - https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html 2. Wikipedia articles - https://simple.wikipedia.org/wiki/Dog - https://en.wikipedia.org/wiki/Car 3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos)) - https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s - https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/ - https://www.youtube.com/watch?v=VD6xJq8NguY - https://www.pnas.org/doi/10.1073/pnas.1711842115 4. License text - https://www.gnu.org/licenses/gpl-3.0.en.html - https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html 5. Books - https://ia902902.us.archive.org/19/items/diaryofawimpykidbookseriesbyjeffkinney_202004/Diary%20of%20a%20wimpy%20kid%20book02%20rodrick%20rules.pdf - https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view