This project is really huge.
Now I'm going to analyze it's first release with very limited functionality.
Functionality to implement
- Abilities to train:
- Read
- Listen
- Flow:
- Show a text and read it to the user highlighting read portion with one color and a word to read next with another color
- Prompt user to select unknown words
- Show explanations for each word from publicly available dictionaries one at a time
- Select the next texts as another uses of unknown words and add them to the reading queue
- For the new user there are a number of simple texts in the queue.
- If queue becomes empty - user knows each word in the text - present a list of topics to select from.
- Ideally new text should not contain more than 10% of new words.
Data
- Corpus of short texts indexed by:
- Words
- Morphems (when possible)
- Idioms (when possible)
- Word explanations collected on demand
- Data about user's performance:
- Frequency, time spent and payload of lessons
- User's dictionary: words what never were marked as unknown or stopped to be marked as such (with count of both states or list of docs)
- Attention: how many words are marked as unknown for the first time after a number of appearences
- Learning effectiveness: how many times is a word marked as unknown?
Technology to use
- A progressive web app (PWA) storing user's data on user's machine.
- same code for desktop/tablet/mobile
- no installation and upgrades
- 100% private for the user
- user == user agent(browser or mobile device)
- Store corpus and word explanations in hosted MongoDB
- MongoDB supports regular expression as data type.
- I know how to do it in SQL. Now it's time to learn NoSQL 😉.
- Hosted (not self-hosted) is a necessity. I'm not ready to run my own web server yet.
- Use clustering for morphem and idiom extraction
- It should be able to accept hits from users (with grain of salt of course)
- Use GraphQL for client-server interaction
- Easier to grow with the progect than REST
- Steeper learning curve
- Use Text-to-Speech (TTS) functionality available in browser
- No need to store audio
- Easy synchronization with text highlight
- For the beginning implement backend in Python
- Python is the most equiped language for ML
- It supports all I need in backend now
Need to learn
- Asynchrony implementation in Python
- GraphQL
- PyMongo
- Motor?
- Graphene or Ariadne?
- Tornado?
Ready snippets
To be continued ...