Tuesday, August 9, 2022

From What to How: language learning app - initial analysis

This project is really huge.
Now I'm going to analyze it's first release with very limited functionality.

Functionality to implement

  • Abilities to train:
    • Read
    • Listen
  • Flow:
    • Show a text and read it to the user highlighting read portion with one color and a word to read next with another color
    • Prompt user to select unknown words
    • Show explanations for each word from publicly available dictionaries one at a time
    • Select the next texts as another uses of unknown words and add them to the reading queue
    • For the new user there are a number of simple texts in the queue.
    • If queue becomes empty - user knows each word in the text - present a list of topics to select from.
    • Ideally new text should not contain more than 10% of new words.

Data

  • Corpus of short texts indexed by:
    • Words
    • Morphems (when possible)
    • Idioms (when possible)
  • Word explanations collected on demand
  • Data about user's performance:
    • Frequency, time spent and payload of lessons
    • User's dictionary: words what never were marked as unknown or stopped to be marked as such (with count of both states or list of docs)
    • Attention: how many words are marked as unknown for the first time after a number of appearences
    • Learning effectiveness: how many times is a word marked as unknown?

Technology to use

  • A progressive web app (PWA) storing user's data on user's machine.
    • same code for desktop/tablet/mobile
    • no installation and upgrades
    • 100% private for the user
    • user == user agent(browser or mobile device)
  • Store corpus and word explanations in hosted MongoDB
    • MongoDB supports regular expression as data type.
    • I know how to do it in SQL. Now it's time to learn NoSQL 😉.
    • Hosted (not self-hosted) is a necessity. I'm not ready to run my own web server yet.
  • Use clustering for morphem and idiom extraction
    • It should be able to accept hits from users (with grain of salt of course)
  • Use GraphQL for client-server interaction
    • Easier to grow with the progect than REST
    • Steeper learning curve
  • Use Text-to-Speech (TTS) functionality available in browser
    • No need to store audio
    • Easy synchronization with text highlight
  • For the beginning implement backend in Python
    • Python is the most equiped language for ML
    • It supports all I need in backend now

Need to learn

  • Asynchrony implementation in Python
  • GraphQL
  • PyMongo
  • Motor?
  • Graphene or Ariadne?
  • Tornado?

Ready snippets

To be continued ...