Around 2.5 years back, I was working on my earlier startup Limitless — a time tracking and productivity company, and sitting alone at Waterfront — a restaurant in Foster City that serves amazing falafel pizza, bested only by the view.
I was thinking of creative ways to grow the product while listening to a Tim Ferriss podcast on personal productivity. Maybe it was the bite of the pizza or the gentle breeze over the calm water, but an idea stuck — what if we could transcribe the podcast episode, highlight some pieces of it and push it as a blog — we could scale content marketing like crazy.
I immediately started searching for speech to text companies and mailed around 7–8 of them. Only one company replied back.
Over the next few days, our team tested a bunch of files but found that the tech was simply not at a level where it would be usable for content marketing. Too many corrections and time was required to make it useful — It was just too early. We parked the idea and a year later, our team stopped working on Limitless to focus on other projects.
But the idea never really left us. There is so much amazing content locked in voice that was not discoverable or shareable. It was a huge opportunity. While working on other projects, we kept track of research in speech to text technology, talked to people in the space and kept hacking solutions.
Then in August- September of 2017, something special happened — we crossed the chasm where the speech to text accuracy was actually usable. We always knew that the big players would be the first to crack the underlying speech to text technology because of the sheer volume of data they had.
As expected, Google, Amazon and bunch of other players have announced robust speech to text APIs and that meant after an excruciating wait of 2.5 years, the right time for our idea had come.
While speech to text is largely solved, a bunch of other challenges remain:
- Creating voice content at scale
- Searching, sharing and organizing voice content
- Making voice content discoverable
For the past 10 months, we have been quietly working on solving these challenge with a product called Spext.
What is Spext?
Spext makes interacting with voice as easy as interacting with text.
It converts uploaded voice files to text using APIs like Google, accurately syncs audio + transcribed text and represents silence/ music textually. That means you can now delete the voice in the background by deleting existing words/ sentences in the transcript instead of using waveforms. Similarly, you can delete filler words like “uh” and “um” fillers or silent parts of the audio. You can’t create new sentences. Yet.
Podcasters, lawyers, marketers, interviewers and millions of others have to work with voice media everyday (both in audio and video form) and need to learn complicated waveform based editors. The learning curve is steep, so most outsource this work to expensive contractors.
On the other hand, billions of people already know how to use a text editor to create, edit and share text content. Spext brings this familiarity of the text editor to media, especially voice content, and we think it will help billions more create, curate and edit voice media easily. Here is a short video of how it works.
Some other features:
- Audio search: Audio search is different than just keyword search. Not only can you search for keywords, say sales, but you can also search things like “What are some sales tips?” and Spext will automatically find parts of the media where people are giving sales tips. This feature is in beta but we are excited to roll it out soon.
- Blazingly fast digital transcripts: We generate 92%+ accurate digital transcripts in minutes even for 2–3 hour long media. With the editor built in you can easily correct the transcript or just portions of it. If you want even more accurate transcription, we have a service called Spectacular that combines talented human transcriptionists with Spext.
- One click post-production: Creating professional sounding audio usually requires audio engineers, but Spext has one click post production with features like background noise reduction, voice balancing of different speakers and loudness control.
- Convert voice to video: Social media sites are optimized for video and not audio/ voice. That makes sharing of voice clips is hard. Spext has pre-built video templates that make it easy to share video snippets of your audio.
This is just the beginning
With devices like Alexa and Google Home, the demand for voice content is growing exponentially. Lowering the barrier for beginners to create voice driven media is just the first step towards meeting this demand. Our users are already creating unique content like voice newsletters, voice snippets and audio blogs.
In the future, voice will be easily searchable, editable and consumable. This version of Spext is just the beginning to make that future happen faster.
Anup is the co-founder of Spext. When not listening to podcasts, he can be found doodling ideas, reading esoteric books or trying to make puns – only half of which are decent.