omega/model/README.md
2025-03-31 23:58:19 +02:00

40 lines
1.8 KiB
Markdown

# omega
## Documentation
First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book.
Those were transformed into a large text file [see all_words.txt](data/all_words.txt) using the following command
```
grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'
```
Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.
For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive
```
awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt
```
## Sources
1. Generic news articles
- https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html
- https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html
2. Wikipedia articles
- https://simple.wikipedia.org/wiki/Dog
- https://en.wikipedia.org/wiki/Car
3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos))
- https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s
- https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/
- https://www.youtube.com/watch?v=VD6xJq8NguY
- https://www.pnas.org/doi/10.1073/pnas.1711842115
4. License text
- https://www.gnu.org/licenses/gpl-3.0.en.html
- https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
5. Books
- https://ia902902.us.archive.org/19/items/diaryofawimpykidbookseriesbyjeffkinney_202004/Diary%20of%20a%20wimpy%20kid%20book02%20rodrick%20rules.pdf
- https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view