40 lines
1.8 KiB
Markdown
40 lines
1.8 KiB
Markdown
# omega
|
|
|
|
## Documentation
|
|
First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book.
|
|
|
|
Those were transformed into a large text file [see all_words.txt](data/all_words.txt) using the following command
|
|
|
|
```
|
|
grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'
|
|
```
|
|
|
|
Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.
|
|
|
|
For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive
|
|
```
|
|
awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt
|
|
```
|
|
|
|
## Sources
|
|
1. Generic news articles
|
|
- https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html
|
|
- https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html
|
|
|
|
2. Wikipedia articles
|
|
- https://simple.wikipedia.org/wiki/Dog
|
|
- https://en.wikipedia.org/wiki/Car
|
|
|
|
3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos))
|
|
- https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s
|
|
- https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/
|
|
- https://www.youtube.com/watch?v=VD6xJq8NguY
|
|
- https://www.pnas.org/doi/10.1073/pnas.1711842115
|
|
|
|
4. License text
|
|
- https://www.gnu.org/licenses/gpl-3.0.en.html
|
|
- https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
|
|
|
|
5. Books
|
|
- https://ia902902.us.archive.org/19/items/diaryofawimpykidbookseriesbyjeffkinney_202004/Diary%20of%20a%20wimpy%20kid%20book02%20rodrick%20rules.pdf
|
|
- https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view |