Thastertyn/omega

History

Thastertyn f7bdc26953 Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

..

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

training_results

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

1000_words.txt

Restructured project

2025-03-31 23:58:19 +02:00

clear.sh

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

count.sh

Restructured project

2025-03-31 23:58:19 +02:00

logistic.ipynb

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

mlp.ipynb

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

README.md

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

transform.py

Final update including testing, final model, new data sources, updated notebook for mlp and logistic regression and final keyboard code

2025-04-06 23:28:30 +02:00

words.sh

Restructured project

2025-03-31 23:58:19 +02:00

README.md

Model and training

This is the section of the code related to training and modifying the model used by this app

Training

There are 2 notebooks related to training

MLP proved to be far more accurate, therefore it is the one used

Data

See Sources for all data used.
Data includes various data including news articles, scientific articles, couple of books and wikipedia articles.

The text was extracted by simply copying the text including some unwanted garbage like numbers, wikipedia links, etc. Simply put Ctrl+C - Ctrl+V.

Next step in data processing was using the included scripts, mainly words.sh which extracts only alphanumeric strings and places them on new lines.
Second was cleaning the data completely by allowing only the 26 english alphabet characters to be present. This is done by clear.sh.
Third and last was turning the data into a numpy array for space efficiency (instead of CSV). This is done by transform.py. This is the last step for data processing and the model can now be trained using this data

Structure

The current model is a character-level predictor that uses the previous 10 characters to predict the next one.
It was trained using the processed word list.
Dataset can be found in data/all_cleaned_words.txt).

Model type: Multilayer Perceptron
Input shape: 10 * 16 - 10 previous characters in form of embeddings of 16 dimensions
Output shape: 26 - 26 probabilities for each letter of the english alphabet
Dataset: 220k words from various types of sources, mainly books though

Sources