omega/model/README.md

55 lines
2.8 KiB
Markdown

# Model and training
This is the section of the code related to training and modifying the model used by this app
## Training
There are 2 notebooks related to training
- [Multilayer Perceptron](./mlp.ipynb)
- [Logistic Regression model](./logistic.ipynb)
MLP proved to be far more accurate, therefore it is the one used
## Data
See [Sources](#Sources) for all data used.
Data includes various data including news articles, scientific articles, couple of books and wikipedia articles.
The text was extracted by simply copying the text including some unwanted garbage like numbers, wikipedia links, etc. Simply put Ctrl+C - Ctrl+V.
Next step in data processing was using the included scripts, mainly [`words.sh`](./words.sh) which extracts only alphanumeric strings and places them on new lines.
Second was cleaning the data completely by allowing only the 26 english alphabet characters to be present. This is done by [`clear.sh`](./clear.sh).
Third and last was turning the data into a numpy array for space efficiency (instead of CSV). This is done by [`transform.py`](./transform.py). This is the last step for data processing and the model can now be trained using this data
## Structure
The current model is a **character-level** predictor that uses the previous 10 characters to predict the next one.
It was trained using the processed word list.
Dataset can be found in [`data/all_cleaned_words.txt`](data/all_cleaned_words.txt)).
- **Model type**: **`Multilayer Perceptron`**
- **Input shape**: **`10 * 16`** - 10 previous characters in form of embeddings of 16 dimensions
- **Output shape**: **`26`** - 26 probabilities for each letter of the english alphabet
- **Dataset**: **`220k`** words from various types of sources, mainly books though
## Sources
1. Generic news articles
- https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html
- https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html
2. Wikipedia articles
- https://simple.wikipedia.org/wiki/Dog
- https://en.wikipedia.org/wiki/Car
3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos))
- https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s
- https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/
- https://www.youtube.com/watch?v=VD6xJq8NguY
- https://www.pnas.org/doi/10.1073/pnas.1711842115
4. License text
- https://www.gnu.org/licenses/gpl-3.0.en.html
- https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
5. Books
- https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view
- https://dhspriory.org/kenny/PhilTexts/Camus/Myth%20of%20Sisyphus-.pdf
- https://www.matermiddlehigh.org/ourpages/auto/2012/11/16/50246772/Beloved.pdf