Let's process sound with Go

Disclaimer: I don’t consider any algorithms and API for working with sound and speech recognition. This article is about problems with audio processing and how to solve them with Go.

gopher

phono is an application framework for sound processing. It’s main purpose - to create a pipeline of different technologies to process sound in a way you need.

What does the pipeline, also of different technologies and why do we need another framework? We will figure that out now.

Whence the sound?

In 2018 sound became a standard way of interaction between human and technology. Majority of IT-giants either already have voice-assistant, or are working on it right now. Voice control integrated into major operating systems and voice messages are default feature of any modern messenger. There are around a thousand of start-ups working on natural language processing and around two hundred on speech-recognition.

Same story with music. It’s getting played from any device and sound recording is available for anyone who has a computer. Music software is developed by hundreds of companies and thousands of enthusiasts around the globe.

Common tasks

If you ever faced sound-processing task, then next conditions should be familiar to you:

Audio should be received from file, device, network, etc.
Audio should be processed: add FX, encode, analyze, etc.
Audio should be sent into file, device, network, etc.
Data is transmitted in small buffers

It turns out a pipeline - there is a data stream that goes through several stages of processing.

Solutions

For clarity, let’s take a real life problem: we need to transform voice into text:

Record audio with a device
Remove noises
Equalize
Send signal to a voice-recognition API

As any other problem, this one has several solutions.

Brute force

Hardcore-developers ~~wheel-inventors~~ only. Record sound directly through a sound interface driver, write smart noise-suppressor and multi-track equalizer. This is very interesting, but you can forget about your original task for several months.

Time-consuming and very complex.

Normal way

The alternative is to use existing audio APIs. It’s possible to record audio with ASIO, CoreAudio, PortAudio, ALSA and others. Multiple standards of plugins is available for processing: AAX, VST2, VST3, AU.

A rich selection doesn’t mean that it’s possible to use everything. Usually next limitations are applied:

Operating system. Not all APIs are available on all OSs. For example, AU is OS X native technology and available only there.
Programming language. The majority of audio libraries is written in C or C++. In 1996 Steinberg has released the first version of VST SDK, still the most popular plugin format. Today, you no longer have to write in C/C++: there are VST wrappers in Java, Python, C#, Rust and who knows what else. Although language remains a limitation, but today people process sound even with Javascript.
Functionality. If the problem is simple, there is no need to write a new application. The FFmpeg has a ton of features.

In this case the complexity depends on your choices. The worst-case scenario - you’ll have to deal with multiple libraries. And if you’re really unlucky, with complex abstractions and absolutely different interfaces.

What in the end?

We have to choose between very complicated and complicated:

deal with several low-level APIs to invent our own wheels
deal with several APIs and try to make them friends

It doesn’t matter which way you’ve chosen, the task always comes down to the pipeline. The technologies may differ, but the essence is unchanged. The problem is that again, instead of solving our problem, we have to ~~invent a wheel~~ build a pipeline.

But there’s an option.

phono

phono is created to solve common tasks - “receive, process and send” the sound. It utilizes pipeline as the most natural abstraction. There is a post in the official Go blog. It describes a pipeline-pattern. The core idea of the pipeline-pattern is that there are several stages of data processing working independently and exchanging data through channels. That’s what is needed.

But, why Go?

At first, a lot of audio software and libraries are written in C. Go is known as its successor. On top of that, there is cgo and a big variety of bindings for existing audio APIs. We can take and use it.

Second, in my opinion, Go is a good language. I don’t want to dive deep, but I note its concurrency possibilities. Channels and goroutines make pipeline implementation significantly easier.

Abstractions

pipe.Pipe struct is a heart of phono - it implements the pipeline pattern. Just as in the blog example there are three stages defined:

pipe.Pump - receive sound, only output channels
pipe.Processor - process sound, input and output channels
pipe.Sink - send sound, only input channels

Data transfers in buffers within a pipe.Pipe. Rules which allow to build a pipe:

pipe_diagram

Single pipe.Pump
Multiple pipe.Processor, placed sequentially
Single or multiple pipe.Sink, placed in parallel
All pipe.Pipe components should have same:
- Buffer size
- Sample rate
- Number of channels

Minimal configuration is a Pump with single Sink, the rest is optional.

Let’s go through several examples.

Easy

Problem: play a wav file.

Let’s express a problem in “receive, process, send” form:

Receive audio from wav file
Send audio to portaudio device

example_easy

Audio is read and immediately played back.

Source code

gist

package example

import (
	"github.com/dudk/phono"
	"github.com/dudk/phono/pipe"
	"github.com/dudk/phono/portaudio"
	"github.com/dudk/phono/wav"
)

// Example:
//		Read .wav file
//		Play it with portaudio
func easy() {
	wavPath := "_testdata/sample1.wav"
	bufferSize := phono.BufferSize(512)
	// wav pump
	wavPump, err := wav.NewPump(
		wavPath,
		bufferSize,
	)
	check(err)

	// portaudio sink
	paSink := portaudio.NewSink(
		bufferSize,
		wavPump.WavSampleRate(),
		wavPump.WavNumChannels(),
	)

	// build pipe
	p := pipe.New(
		pipe.WithPump(wavPump),
		pipe.WithSinks(paSink),
	)
	defer p.Close()
	
	// run pipe
	err = p.Do(pipe.Run)
	check(err)
}

First we create all elements of a pipeline: wav.Pump and portaudio.Sink and use them in a constructor pipe.New. Function p.Do(pipe.actionFn) error starts a pipeline and awaits until it’s done.

More difficult

Problem: cut wav file into samples, put them into track, save and play the result at the same time.

Sample is a small piece of audio and track is a sequence of samples. In order to sample the audio, we have to put it into memory first. We can use asset.Asset from phono/asset package to serve this purpose. Express the problem in standard steps:

Receive audio from wav file
Send audio to memory

Now we can make samples, add them to the track and finalize the solution:

Receive audio from track
Send audio to:
- wav file
- portaudio device

Again, there is no processing stage, but we have two pipelines now!

example_normal

Source code

gist


package example

import (
	"github.com/dudk/phono"
	"github.com/dudk/phono/asset"
	"github.com/dudk/phono/pipe"
	"github.com/dudk/phono/portaudio"
	"github.com/dudk/phono/track"
	"github.com/dudk/phono/wav"
)

// Example:
//		Read .wav file
// 		Split it to samples
// 		Put samples to track
//		Save track into .wav and play it with portaudio
func normal() {
	bufferSize := phono.BufferSize(512)
	inPath := "_testdata/sample1.wav"
	outPath := "_testdata/example4_out.wav"

	// wav pump
	wavPump, err := wav.NewPump(inPath, bufferSize)
	check(err)

	// asset sink
	asset := &asset.Asset{
		SampleRate: wavPump.WavSampleRate(),
	}

	// import pipe
	importAsset := pipe.New(
		pipe.WithPump(wavPump),
		pipe.WithSinks(asset),
	)
	defer importAsset.Close()
	err = importAsset.Do(pipe.Run)
	check(err)

	// track pump
	track := track.New(bufferSize, asset.NumChannels())

	// add samples to track
	track.AddFrame(198450, asset.Frame(0, 44100))
	track.AddFrame(66150, asset.Frame(44100, 44100))
	track.AddFrame(132300, asset.Frame(0, 44100))

	// wav sink
	wavSink, err := wav.NewSink(
		outPath,
		wavPump.WavSampleRate(),
		wavPump.WavNumChannels(),
		wavPump.WavBitDepth(),
		wavPump.WavAudioFormat(),
	)
	// portaudio sink
	paSink := portaudio.NewSink(
		bufferSize,
		wavPump.WavSampleRate(),
		wavPump.WavNumChannels(),
	)

	// final pipe
	p := pipe.New(
		pipe.WithPump(track),
		pipe.WithSinks(wavSink, paSink),
	)

	err = p.Do(pipe.Run)
}

Compared to the previous example, there are two pipe.Pipe. First one transfers data into memory, so we can do the sampling. Second one has two sinks in the final stage: wav.Sink and portaudio.Sink. With this configuration, the sound is simultaneously saved to a wav file and played.

Even more difficult

Problem: read two wav files, mix them, process with vst2 plugin and save into a new wav file.

There is a simple mixer mixer.Mixer at phono/mixer package. It allows to send signals from multiple sources and receive a single one mixed. To achieve this, it implements both pipe.Sink and pipe.Pump at the same time.

Again, the problem consists of two small ones. First one looks like this:

Receive audio from wav file
Send audio to mixer

Second:

Receive audio from mixer
Process audio with plugin
Send audio to wav file

example_hard

Source code

gist


package example

import (
	"github.com/dudk/phono"
	"github.com/dudk/phono/mixer"
	"github.com/dudk/phono/pipe"
	"github.com/dudk/phono/vst2"
	"github.com/dudk/phono/wav"
	vst2sdk "github.com/dudk/vst2"
)

// Example:
//		Read two .wav files
//		Mix them
// 		Process with vst2
//		Save result into new .wav file
//
// NOTE: For example both wav files have same characteristics i.e: sample rate, bit depth and number of channels.
// In real life implicit conversion will be needed.
func five() {
	bs := phono.BufferSize(512)
	inPath1 := "../_testdata/sample1.wav"
	inPath2 := "../_testdata/sample2.wav"
	outPath := "../_testdata/out/example5.wav"

	// wav pump 1
	wavPump1, err := wav.NewPump(inPath1, bs)
	check(err)

	// wav pump 2
	wavPump2, err := wav.NewPump(inPath2, bs)
	check(err)

	// mixer
	mixer := mixer.New(bs, wavPump1.WavNumChannels())

	// track 1
	track1 := pipe.New(
		pipe.WithPump(wavPump1),
		pipe.WithSinks(mixer),
	)
	defer track1.Close()
	// track 2
	track2 := pipe.New(
		pipe.WithPump(wavPump2),
		pipe.WithSinks(mixer),
	)
	defer track2.Close()

	// vst2 processor
	vst2path := "../_testdata/Krush.vst"
	vst2lib, err := vst2sdk.Open(vst2path)
	check(err)
	defer vst2lib.Close()

	vst2plugin, err := vst2lib.Open()
	check(err)
	defer vst2plugin.Close()

	vst2processor := vst2.NewProcessor(
		vst2plugin,
		bs,
		wavPump1.WavSampleRate(),
		wavPump1.WavNumChannels(),
	)

	// wav sink
	wavSink, err := wav.NewSink(
		outPath,
		wavPump1.WavSampleRate(),
		wavPump1.WavNumChannels(),
		wavPump1.WavBitDepth(),
		wavPump1.WavAudioFormat(),
	)
	check(err)

	// out pipe
	out := pipe.New(
		pipe.WithPump(mixer),
		pipe.WithProcessors(vst2processor),
		pipe.WithSinks(wavSink),
	)
	defer out.Close()

	// run all
	track1Done, err := track1.Begin(pipe.Run)
	check(err)
	track2Done, err := track2.Begin(pipe.Run)
	check(err)
	outDone, err := out.Begin(pipe.Run)
	check(err)

	// wait results
	err = track1.Wait(track1Done)
	check(err)
	err = track2.Wait(track2Done)
	check(err)
	err = out.Wait(outDone)
	check(err)
}

Here we have three instances of pipe.Pipe, all connected with a mixer. Execution is started with p.Begin(pipe.actionFn) (pipe.State, error). In compare to p.Do(pipe.actionFn) error, it doesn’t block the call and just returns expected state. The state can be awaited with p.Wait(pipe.State) error.

What’s next?

I want phono to be a very convenient application framework. If there is a task with sound, you do not need to understand complex APIs and spend time studying the standards. All what you need is to build a pipeline with suitable elements and launch it.

In six months next packages are built:

phono/wav - read/write wav files
phono/vst2 - not completed VST2 SDK bindings, just open plug-ins and call it’s methods, not all structures are mapped
phono/mixer - mixer, summarizes N signals, no balance and volume
phono/asset - sampling buffers
phono/track - sequential read of buffers
phono/portaudio - playback audio, experimental

In addition to this list, there is a constantly growing backlog of new ideas, among which:

Time measurement
Mutable on-the-flight pipeline
HTTP pump/sink
Parameters automation
Resampling-processor
Balance and volume for mixer
Real-time pump
Synchronized pump for multiple tracks
Full vst2

Topics for upcoming articles:

lifecycle of pipe.Pipe - it’s state is managed with finite state machine due to complex internal structure
how to write your own pipe stages

This is my first open-source project, so I would be happy to get any help and recommendations. Be welcome.

dudk