55 lines
2.8 KiB
Markdown
55 lines
2.8 KiB
Markdown
# Model and training
|
|
|
|
This is the section of the code related to training and modifying the model used by this app
|
|
|
|
## Training
|
|
There are 2 notebooks related to training
|
|
- [Multilayer Perceptron](./mlp.ipynb)
|
|
- [Logistic Regression model](./logistic.ipynb)
|
|
|
|
MLP proved to be far more accurate, therefore it is the one used
|
|
|
|
## Data
|
|
See [Sources](#Sources) for all data used.
|
|
Data includes various data including news articles, scientific articles, couple of books and wikipedia articles.
|
|
|
|
The text was extracted by simply copying the text including some unwanted garbage like numbers, wikipedia links, etc. Simply put Ctrl+C - Ctrl+V.
|
|
|
|
Next step in data processing was using the included scripts, mainly [`words.sh`](./words.sh) which extracts only alphanumeric strings and places them on new lines.
|
|
Second was cleaning the data completely by allowing only the 26 english alphabet characters to be present. This is done by [`clear.sh`](./clear.sh).
|
|
Third and last was turning the data into a numpy array for space efficiency (instead of CSV). This is done by [`transform.py`](./transform.py). This is the last step for data processing and the model can now be trained using this data
|
|
|
|
## Structure
|
|
The current model is a **character-level** predictor that uses the previous 10 characters to predict the next one.
|
|
It was trained using the processed word list.
|
|
Dataset can be found in [`data/all_cleaned_words.txt`](data/all_cleaned_words.txt)).
|
|
|
|
- **Model type**: **`Multilayer Perceptron`**
|
|
- **Input shape**: **`10 * 16`** - 10 previous characters in form of embeddings of 16 dimensions
|
|
- **Output shape**: **`26`** - 26 probabilities for each letter of the english alphabet
|
|
- **Dataset**: **`220k`** words from various types of sources, mainly books though
|
|
|
|
|
|
## Sources
|
|
1. Generic news articles
|
|
- https://edition.cnn.com/2025/03/20/middleeast/ronen-bar-shin-bet-israel-vote-dismiss-intl-latam/index.html
|
|
- https://edition.cnn.com/2025/03/21/europe/conor-mcgregor-ireland-president-election-intl-hnk/index.html
|
|
|
|
2. Wikipedia articles
|
|
- https://simple.wikipedia.org/wiki/Dog
|
|
- https://en.wikipedia.org/wiki/Car
|
|
|
|
3. Scientific articles ([Kurzgesagt](https://www.youtube.com/@kurzgesagt/videos))
|
|
- https://www.youtube.com/watch?v=dCiMUWw1BBc&t=766s
|
|
- https://news.umich.edu/astronomers-find-surprising-ice-world-in-the-habitable-zone-with-jwst-data/
|
|
- https://www.youtube.com/watch?v=VD6xJq8NguY
|
|
- https://www.pnas.org/doi/10.1073/pnas.1711842115
|
|
|
|
4. License text
|
|
- https://www.gnu.org/licenses/gpl-3.0.en.html
|
|
- https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
|
|
|
|
5. Books
|
|
- https://drive.google.com/file/d/1b1Etdxb1cNU3PvDBQnYh0bCAAfssMi8b/view
|
|
- https://dhspriory.org/kenny/PhilTexts/Camus/Myth%20of%20Sisyphus-.pdf
|
|
- https://www.matermiddlehigh.org/ourpages/auto/2012/11/16/50246772/Beloved.pdf |