Model and training
This is the section of the code related to training and modifying the model used by this app
Training
There are 2 notebooks related to training
MLP proved to be far more accurate, therefore it is the one used
Data
See Sources for all data used.
Data includes various data including news articles, scientific articles, couple of books and wikipedia articles.
The text was extracted by simply copying the text including some unwanted garbage like numbers, wikipedia links, etc. Simply put Ctrl+C - Ctrl+V.
Next step in data processing was using the included scripts, mainly words.sh
which extracts only alphanumeric strings and places them on new lines.
Second was cleaning the data completely by allowing only the 26 english alphabet characters to be present. This is done by clear.sh
.
Third and last was turning the data into a numpy array for space efficiency (instead of CSV). This is done by transform.py
. This is the last step for data processing and the model can now be trained using this data
Structure
The current model is a character-level predictor that uses the previous 10 characters to predict the next one.
It was trained using the processed word list.
Dataset can be found in data/all_cleaned_words.txt
).
- Model type:
Multilayer Perceptron
- Input shape:
10 * 16
- 10 previous characters in form of embeddings of 16 dimensions - Output shape:
26
- 26 probabilities for each letter of the english alphabet - Dataset:
220k
words from various types of sources, mainly books though
Sources
-
Generic news articles
-
Wikipedia articles
-
Scientific articles (Kurzgesagt)
-
License text
-
Books