Skip to content

Jarvis MVP (Bartmoss' personal AI assistant)

secretsauceai edited this page Aug 14, 2021 · 1 revision

Overview

This is just a very rough (read: crappy) first draft of the MVP, very much a work in progress (also for an MVP, its massive). Hopefully this can also help you write an MVP.

Jarvis is my personal AI assistant. It focuses on the following points:

  • modular: parts of the system can be easily swapped out for changes, upgrades
  • communication standards: uses popular protocols for communication
  • light, accurate wakeword works on raspi4, raspi 0, phone, etc.
  • Client-Server: It runs on satellite devices (ie raspi 0) while some systems (ie TTS) are passed to the server to run
  • NLU/NLG and actions: light enough to run on raspi 4
  • ASR/TTS: light enough to run on Jetson Xavier NX or NUC
  • ITIL compliant: self-help, requests, issues, problems, monitoring, etc.
  • learning: system can record and adapt to simple request for changes automatically (ie adding or changing the intent of an utterance)
  • open source
  • self hosted services as skills: instead of just google, amazon, whatever service for skills, many of the skills can be self hosted services

Current state

  • raspi4 8gb ram, fan cooled, ARM64
  • runs in the living room only
  • uses PS eye microphone array (tweaked to reduce noise and enhance voice with pulse audio)
  • ASR is handled by google API instead of locally
  • TTS runs locally with mimic (but it sounds terrible)
  • Built own wake word (near production level quality)
    • Just a few too many false wake ups per week (~4 a week, I want to reduce it to ~1 a week)
    • Overly sensitive, requires parameter fine tuning to avoid false wake ups
      • Failed tflite testing as production model
  • Mostly uses Mycroft
    • Home assistant skill to control the lights
      • Only really works with the lights
      • NLG grammar issues on plural entities (ie 'kitchen lights', 'the kitchen lights is now off')
    • Weather skill can't seem to say the weather on the weekend (or in the future by date, or number of days), it offers today's weather.
      • This issue was raised to the developers
      • Possible hot fix: change to dev branch of weather skill?
    • On fall back, the timer is always triggered falsely
      • Can't permanently remove timer or fall back skill from Mycroft as hot fix
      • Reinstalling or upgrading doesn't seem to solve this issue (both reinstallation of the skills and the whole system)
      • If it hears anything, it will keep repeating 'how long of a timer' as a follow up until the environment is silent
      • This issue was raised to the developers
      • This really needs to be fixed, one way or another
  • Experiments
    • Telegram skill
      • Responses are still spoken out loud
      • Lots of strange errors (this needs a lot of work!)
    • Rasa skill
      • The dialog handling for ASR input isn't ideal (it beeps when it wants an input, and does this for every turn of the dialog)
      • Forms aren't handled codelessly in RasaX
    • Issue reporting skill: diagnostic mode
      • Initially built with Adapt (Padatious doesn't support follow ups)
      • It works for simple issue reporting and sending an email, complex forms not implemented
      • tried to implement further with Rasa skill, it didn't really pop off right
      • I was able to get the last utterance (not counting the utterance to trigger 'diagnostic mode'), response, etc. from Mycroft and even pass it to Rasa (however these features are very poorly documented for Mycroft, it takes lots of guess work)
    • Learning skill
      • It works okay for creating utterances (keywords for Adapt) and responses.
      • No deeper dialog generation possible (ie follow ups)
      • Otherwise limited to only basic utterance response, no actions, mapping to existing intents, etc.

Next state

This will be the final state of using mostly Mycroft in production (as a whole system), after this Jarvis will be pivoted to Rhasspy in production (with some components of Mycroft)

  • create final production wakeword model (tf lite production model)
    • blocker: wakeword data prep must be done
  • switch to dev branch for weather as quick fix?
  • attempt solution to fallback timer trigger?

Future state

  • Rhasspy
  • Mycroft components (forked version of Precise, some of the skills)
  • Rasa?
  • Snips skills?
  • communication via telegram (at least text, recording audio messages to be used as via ASR to transcribe the utterance is a nice to have)
  • NLG grammatical agreement
  • nice to have: quick commands (a quick command can be used without requiring the wake word first, ie 'Jarvis I left the grill on' → 'I've set a timer for ten minutes to remind you to turn the grill off')
  • centralized skills
    • automated defect management (diagnostic mode)
    • briefs (informational briefs)
    • self-help (OQA on documentation)
    • expanded home assistant skill from Mycroft (ie intent to vacuum a room with the vacuum robot)
    • onboarding (tour of home and of Jarvis)

User Stories

  • Onboarding: When a new user comes to my home, they should be onboarded (given a tour by Jarvis).
    • When prompted (ie 'hey jarvis, introduce yourself to [name]), Jarvis will introduce itself (did it get the name correctly?)
      • Jarvis asks the user to try the wakeword
        • if the wakeword doesn't work, Jarvis collects new recordings from the user to be processed into the wakeword model
      • Jarvis gives a tour of the home, showing users the home while teaching them how to use the hardware and software of the home (including utterances for the voice assistant)
  • Self help and incident management: When a user wants to raise an issue or has a question Jarvis can address it.
    • Automated defect management: diagnostic mode
      • Upon fallback (when there is no intent for an utterance), the automated defect management skill (diagnostic mode) is activated
        • User can skip this if they don't want to go through the questions to diagnose
        • Jarvis asks the user if the utterance was transcribed correctly (detect if ASR issue)
        • If the user says the utterance is correctly transcribed, Jarvis asks the user other questions to determine if the utterance should be mapped to an existing intent (a variant utterance) or if it is currently out of scope, etc.
      • User can activate diagnostic mode at any time
        • Jarvis asks a series of questions to determine the issue (ie last utterance went to wrong intent and needs to be mapped to another intent)
        • If the issue can't be hot fixed, the user can get Jarvis to make a note of this issue (report an issue, ie send as email, add to a DB, etc.)
      • User can report false wake ups
        • when Jarvis is inadvertently activated, a user can simply say 'false wakeup' and it goes to the diagnostic mode intent for false wake ups, the audio that caused the false wake up is saved to later improve the model
    • Self help desk
      • A user can ask questions that will be answered by the documentation of the projects used
      • examples
        • How do I connect my lights to be controlled automatically?
        • How do I use X (service)?
  • User can write Jarvis via telegram (or slack, etc.) instead of just speaking to Jarvis, the user receives the answer via text reply
  • Configurable informational briefs
    • User can setup informational briefs (without coding)
    • User can say an utterance and receive a customized informational brief
      • Example
        • User: 'hey jarvis, good morning'
        • Jarvis: 'good morning [name], here's your 'good morning' brief...
          • "weather"
          • "calendar"
          • "Stocks/crypto"
  • Home assistant
    • User can use utterances to control the entities in home assistant (similar to home assistant skill in Mycroft)
  • When a user asks who they are, Jarvis answers with their user name (knows the identity)
  • When a user asks where they are, Jarvis answers which room they are in (knows the location) (nice to have)

Problem Statements

  • modular control of pipeline components can be a challenge
  • wakeword
    • making a custom wakeword is hard and time consuming
    • Getting it working on many devices (low powered too) isn't standard
    • Even if it works on many devices, it still has to be accurate which is a hard balancing act
  • NLU-NLG
    • Selecting the correct NLU engine(s) isn't exactly a walk in the park and you can get tied down to that system
    • lack of skills
    • building new skills isn't so easy
    • automated defect management isn't even addressed in most projects
    • data collection is always a problem
  • ASR (we will skip this for now)
  • TTS (we will also skip this for now)
  • identity
    • identify user from voice (yet another thing that I have never tried)
    • identify location of user (reading about locating people via home assistant has taught me this is harder than you would think)

Proposed Solutions

  • modular control
    • Use Rhasspy to control the pipeline components together
  • wakeword
    • Wakeword data collection system (I have already collected all of the data for my wake word)
    • Use Precise with tflite
      • Several levels of compressed models
        • Compressed models run on satellite devices (set with a slightly higher sensitivity: rather too many false positives than false negatives)
        • Uncompressed model sets on server to verify positive passed from satellite in a more accurate way
    • Optimize the data set for sub-categories and training-test split (I have done this manually, I will use the data prep script once it is done to optimize my models further)
    • NLU-NLG
      • Test different NLU-NLG engines to find optimal solution to intent classification and response generation
      • Use pre-built skills from other projects (Mycroft, Snips) to 'fill in' missing skills
      • Find a better way to build skills (?)
      • Build in automated defect management (user facing defects) skills to handle defects by 'learning'

Success Criteria

  • wakeword works 5 time in a row and averages only 1 false wake up per week
    • Doesn't work with just 'Jarvis', requires 'hey Jarvis'
  • latency of wakeword recognition decreases by at least 35%
  • reduction in reported unresolved defects by 80%
  • all missing variations on an utterance can be handled (that are transcribed by ASR correctly)
  • average daily usage of Jarvis (measured by number of utterances) increases by 300%
  • NLG 98% grammatically correct on test cases

Scope

Define what will be done and what will not be done as part of this project.

Requirements

Non-Requirements