omega/model/README.md

# omega

## Documentation
First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book.

Those were transformed into a large text file [see all_words.txt](data/all_words.txt) using the following command

```
grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'
```

Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.

For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive
```
awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt
```

## Sources
1. Generic news articles
    - https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html
    - https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html

2. Wikipedia articles
    - https://simple.wikipedia.org/wiki/Dog
    - https://en.wikipedia.org/wiki/Car

3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos))
    - https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s
        - https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/
    - https://www.youtube.com/watch?v=VD6xJq8NguY
        - https://www.pnas.org/doi/10.1073/pnas.1711842115

4. License text
    - https://www.gnu.org/licenses/gpl-3.0.en.html
    - https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

5. Books
    - https://ia902902.us.archive.org/19/items/diaryofawimpykidbookseriesbyjeffkinney_202004/Diary%20of%20a%20wimpy%20kid%20book02%20rodrick%20rules.pdf
    - https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view