Digital period (1970-2000) - Digital sound encoding (DSE) was then invented by Sony in the
1970s, allowing for analog sound (e.g. amps) recorded through a range of types of microphones
(e.g. a condenser microphone) to be converted into digital representations (e.g. numbers).
Analog-to-digital converters (A/D converters)
and digital-to-analog converters (D/A
converters)
led to a range of compressed audio formats to emerge (e.g. .WAV file format).
These advances led to many more voice recordings to be possible across an array of consumer
hardware.
Shortly thereafter, personal computers (~1981) made voice computing accessible to
consumers. Particularly, consumers for the first time were able to hear primitive text-to-speech
systems from text typed into their keyboards (e.g. iMac - TTS license).
The advent of the
Internet (1989) allowed for voice media to be played back through web browsers (e.g.1995 -
Internet Explorer). Storage media on personal computers (PCs), like CD and DVD writers
(1990s), also greatly contributed to adoption and distribution of digital file formats. Moreover,
PCs allowed for the development of new programming languages, such as Python, which later
became quite significant for processing voice data.
Internet period (2000-2010) - The 2000s was the golden era to record and publish high-quality
digital voice recordings on the Internet.
Lossless speech codecs (e.g. 2001 - .FLAC) were
invented, significantly reducing audio file sizes without sacrificing their overall quality. The
adoption of high-fidelity microphones (e.g. MEMS mics) and sound cards on PCs led to
podcasting communities to emerge (e.g. iTunes / YouTube), allowing for voice and video
content to be distributed and monetized worldwide.
Google Chrome was also released in
2008, which created new experiences to record and playback audio within the web browser
(e.g. through embedded audio and video codecs). Smartphones (e.g. iPhones) also became
widely available, creating new opportunities to record and transmit voice content (e.g. through
VoIP networks like Skype).
Moreover, software packages like FFmpeg
and SoX
allowed for developers with personal
computers to generate and manipulate sounds from the command line. Many Python libraries
emerged (e.g. LibROSA, NLTK, scikit-learn) to featurize and model text and speech data. Taken
together, these shifts made Python a great choice for voice computing projects.
Analog-to-digital converters (A/D converters) convert analog sound (amps/sec) into samples (numbers).
Digital-to-analog converters (D/A converters) convert digital representations of sound back into analog sound.
Consumers also heard TTS systems during this time through interfaces like Atari or video games. It’s interesting
to view such consoles as voice computers, as they were one of the first interfaces for voice computing.
History of interfaces - First, we had keyboards to start and stop commands. Then, as applications became more
sophisticated, the mouse was invented to help navigate through these applications more seamlessly (e.g. save files
on computers and navigate through various applications). New interfaces then emerged as computers became
miniaturized: CD players (walkmans, etc.), ipods, and blackberries. It was not until the iPhone that the interface for
smartphones became optimal: phones became pocket computers [screen/cpu] loaded with applications enabled with
“touchscreen interfaces.” [mouse/keyboard]
iTunes - I listened to the Brain Science Podcast (hosted by Ginger Campbell) and the Stanford Entrepreneurial
Thought Leaders Seminars (hosted by Tina Seelig / Stanford) during this time.
FFmpeg is a powerful program for transducing one file format or another.
SoX is another program to manipulate audio files to do things like silence removal and automatic generation of
sounds.