Last summer I made a prototype, Button Inputter, using a combination of an OS customization app and a Raspberry Pi Pico. It worked most of the time but occastionally failed and with the chain of steps the experience was built on. It was hard to track down the weak links for audio focus or permissions issues that popped up. I also found that I was using it a lot, like 11,000-times-in-the-last-six-months a lot. This tool clearly had value to me and was due time to make it more durable and have a stable method for optimizing and fixing issues.

I rebuilt the software experience as VoiceButton, a MacOS app to handle the transcription and visualizing the interaction. It runs as a menu bar with easy access to the app and can run in the background while listening for the input from the hardware button. The transcription model, Whisper, is downloaded automatically and runs more efficiently now than my original method. For example, single sentence transcribes in under 100 milliseconds which basically shows up instantaneously with releasing the buttons.
I had a small set of goals that a dedicated app would afford. I wanted to improve visual user feedback, add some delight since it's a product interaction surface I see many times a day, and to be a platform for ongoing experiements.
After the initial use with limited visual feedback, I knew I needed more clarity on what the system was doing. I wanted it to be clear that the button press was detected because I don't have 100% trust yet when it works (and I feel dumb when I talk to something and find out it wasn't listening), and to know how long transcription will take when it's more than 100-150ms (perceived as instaneous). The latter is a major need when doing follow-on steps like processing with an LLM that may take a second or more to process.
The recording state uses a blooming pattern that expands as content comes in. As you're speaking and the voice level increases, the animation expands out at a faster rate. If the system is able to figure out where the text is being inputted, then it also shows that animation at the location where the cursor is.
For processing, the animation duration is estimated from the length of recorded audio, calibrated against past processing times of recent use. The bloomed pattern then dissolves into the transcribed text as a reverse of the input. I've enjoyed seeing the animation when making very long transcriptions (+1 minute) when something is taking several seconds to process, or when doing experiments with LLM processing, where a long bit of text may need to be converted into something else, getting to see the transformation in action.
The app is free, runs with just the fn key, no hardware required (download link below). Everything runs locally after the initial model download. No audio or data leaves your machine.
VoiceButton App
Now available!
A push-to-talk transcription app for macOS that runs completely locally. Voice input transcribed in 100–200ms.
Email me for questions.