Before there was the printing press, the pen or even the papyrus scroll, there were stories – stories of wisdom, ideas and understanding of the world passed down generations through oral history. Around campfires in ancient caves and water coolers in modern offices, voice is and always has been the most natural medium for people to communicate.
Voice as the second citizen
When the computer revolution happened, voice became a second citizen on the web, as computers found it notoriously hard to understand voice. The voice content we create — in interviews, videos and podcasts largely remains in a silo, not indexed or easily searchable and discoverable via computers.
There is literally billions of hours of voice that is dark — not searchable or easily accessible by search engines.
The second coming
On one hand you have tons of voice content that is not accessible and on the other hand, voice is having a second coming with devices like Alexa and Google Home. With Airpods, the friction to access audio content is reducing and in turn, more and more people are consuming voice content — the 100% YoY growth of podcasts is a great indicator.
So clearly, the demand for content is rising. A lot of evergreen content is already present and needs to be created. We are clearly moving to a world where voice will be a popular way of computer interaction. So, what’s missing?
Picks and Shovels
Whenever a new method of interaction appears, there is explosive demand for the new type of content and you need easy to use picks and shovels — tools that anyone can use to create such content.
We have seen this before — with the rise of smartphones, touch screens became important and the world needed easy to use tools to create visual content, giving rise to companies like Canva, Invision and Sketch. The word processor did the same for text. Such tools are still missing for voice.
The core issues are:
1. Processing voice is hard — you need to look at waveforms to make any changes to it. What works for music does not really work for voice.
2. Processing voice is expensive — you need to work with transcriptionists and audio engineers who charge by the hour.
If we are to meet the demand for voice content, we need to make interacting with voice as easy as text.
The good news is that speech to text technology is improving and the accuracy is approaching human levels. Like cloud storage, this technology will commoditize and greatly reduce transcription costs.
But a lot of opportunities are up for grabs
- Getting to 100% accuracy, especially in a vertical like medical, legal etc.
- Enabling voice content creation at scale
- Searching, sharing and organizing voice content
- Making voice content searchable
We believe in the future, interacting with voice will be as easy as interacting with text.
We are building Spext to accelerate us to that future — where voice is easily editable, searchable and easily consumable. Onwards & upwards…
Anup is the co-founder of Spext. When not listening to podcasts, he can be found doodling ideas, reading esoteric books or trying to make puns – only half of which are decent.