40 lines
		
	
	
		
			1.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			40 lines
		
	
	
		
			1.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # omega
 | |
| 
 | |
| ## Documentation
 | |
| First I gathered a couple long text source, like the GNU GPL license, Wikipedia articles, or even a book. 
 | |
| 
 | |
| Those were transformed into a large text file [see all_words.txt](data/all_words.txt) using the following command
 | |
| 
 | |
| ```
 | |
| grep -o "[[:alpha:]]\{1,\}" "path_to_individual_source.txt" | tr '[:upper:]' '[:lower:]'
 | |
| ```
 | |
| 
 | |
| Which simply finds words at least 1 character long and unifies them by transforming them all to lowercase.
 | |
| 
 | |
| For the model to have as much accuracy as possible, I calculated the average word length (5.819) and went with character history of 5 letters. This is for now the norm and can easily be omitted from the data if it becomes excessive
 | |
| ```
 | |
| awk '{ total += length; count++ } END { if (count > 0) print total / count }' 1000_words.txt
 | |
| ```
 | |
| 
 | |
| ## Sources
 | |
| 1. Generic news articles
 | |
|     - https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html
 | |
|     - https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html
 | |
| 
 | |
| 2. Wikipedia articles
 | |
|     - https://simple.wikipedia.org/wiki/Dog
 | |
|     - https://en.wikipedia.org/wiki/Car
 | |
| 
 | |
| 3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos))
 | |
|     - https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s
 | |
|         - https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/
 | |
|     - https://www.youtube.com/watch?v=VD6xJq8NguY
 | |
|         - https://www.pnas.org/doi/10.1073/pnas.1711842115
 | |
| 
 | |
| 4. License text
 | |
|     - https://www.gnu.org/licenses/gpl-3.0.en.html
 | |
|     - https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
 | |
| 
 | |
| 5. Books
 | |
|     - https://ia902902.us.archive.org/19/items/diaryofawimpykidbookseriesbyjeffkinney_202004/Diary%20of%20a%20wimpy%20kid%20book02%20rodrick%20rules.pdf
 | |
|     - https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view |