Place for jupyter notebooks and other code related to SI15.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

DJDataset.ipynb 16KB

DJ Dataset

Goal: Produce some audio "digital deconstructions" based on training samples from the the Common Voice dataset. In the process, explore what a dataset is and how it relates to the technique of Deep Learning (and situate this term in a larger context). Consider how artistic interventions can go beyond using a novel technique, to (also) "talking back" to these technologies and work on a critical / reflective level.

Common Voice

Common Voice is part of Mozilla's initiative to help teach machines how real people speak. In addition to the Common Voice dataset, we’re also building an open source speech recognition engine called Deep Speech. Both of these projects are part of our efforts to bridge the digital speech divide. Voice recognition technologies bring a human dimension to our devices, but developers need an enormous amount of voice data to build them. Currently, most of that data is expensive and proprietary. We want to make voice data freely and publicly available, and make sure the data represents the diversity of real people. Together we can make voice recognition better for everyone.


DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. Project DeepSpeech uses Google's TensorFlow to make the implementation easier.

Training Data

Large-scale deep learning systems require an abundance of labeled data. For our system we need many recorded utterances and corresponding English transcriptions, but there are few public datasets of sufficient scale. To train our largest models we have thus collected an extensive dataset consisting of 5000 hours of read speech from 9600 speakers. For comparison, we have summarized the labeled datasets available to us in Table 2.

Deep Speech: Scaling up end-to-end speech recognition

Speech Corpora

We started by downloading freely available speech corpora like TED-LIUM and LibriSpeech,, as well as acquiring paid corpora like Fisher and Switchboard. We wrote importers in Python for the different data sets that convert the audio files to WAV, split the audio and cleaned up the transcription of unneeded characters like punctuation and accents. Finally we stored the preprocessed data in CSV files that can be used to feed data into the network.


To build a speech corpus that’s free, open source, and big enough to create meaningful products with, we worked with Mozilla’s Open Innovation team and launched the Common Voice project to collect and validate speech contributions from volunteers all over the world. Today, the team is releasing a large collection of voice data into the public domain. Find out more about the release on the Open Innovation Medium blog.

A Journey to <10% Word Error Rate

What is Deep Learning

In the past few years, artificial intelligence (AI) has been a subject of intense media hype. Machine learning, deep learning, and AI come up in countless articles, often outside of technology-minded publications. We’re promised a future of intelligent chatbots, self-driving cars, and virtual assistants—a future sometimes painted in a grim light and other times as utopian, where human jobs will be scarce and most economic activity will be handled by robots or AI agents. For a future or current practitioner of machine learning, it’s important to be able to recognize the signalin the noise so that you can tell world-changing developments from overhyped press releases. Our future is at stake, and it’s a future in which you have an activerole to play: after reading this book, you’ll be one of those who develop the AI agents. So let’s tackle these questions: What has deep learning achieved so far?How significant is it? Where are we headed next? Should you believe the hype?


Redlining gets its name because the practice first involved drawing literal red lines on a map. (Sometimes the areas were shaded red instead, as in the map in figure 2.2.) All of Detroit’s Black neighborhoods fall into red areas on this map because housing discrimination and other forms of structural oppression predated the practice. But denying home loans to the people who lived in these neighborhoods reinforced those existing inequalities and, as decades of research have shown, were directly responsible for making them worse.

"Machine Learning for artists"

ml4a is a collection of free educational resources devoted to machine learning for artists.

It contains an in-progress book which is being written by @genekogan and can be seen in draft form here. Four chapters are complete and others are in varying stages of progress or just stubs containing links.

The book is complemented by a set of 40+ instructional guides maintained by collaborators, along with interactive demos and figures, and video lectures.

Notes for artistic intervention

  • Beware reinforcing the hype -- Deflate over inflated claims
  • Track down, look at, make visible, and question the data sets
  • Explore the "errors" the model makes
  • Make the predictive nature of the models more apparent.

The Coded Gaze

Face detection and classification algorithms are also used by US-based law enforcement for surveillance and crime prevention purposes. In “The Perpetual Lineup”, Garvie and colleagues provide an in-depth analysis of the unregulated police use of face recognition and call for rigorous standards of automated facial analysis, racial ac- curacy testing, and regularly informing the pub- lic about the use of such technology (Garvie et al., 2016). Past research has also shown that the accuracies of face recognition systems used by US-based law enforcement are systematically lower for people labeled female, Black, or be- tween the ages of 18—30 than for other demo- graphic cohorts (Klare et al., 2012). The latest gender classification report from the National In- stitute for Standards and Technology (NIST) also shows that algorithms NIST evaluated performed worse for female-labeled faces than male-labeled faces (Ngan et al., 2015).

Buolamwini, J., Gebru, T. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” Proceedings of Machine Learning Research 81:1-15, 2018 Conference on Fairness, Accountability, and Transparency


These models are not “pure algorithms”, but in fact are the product of a being trained with thousands of examples, images that have been given labels like “happy, neutral, angry, and disgusted”. FER2013 itself is a troubled archive, created by university students for a computer science competition, which stipulated that the images would not be part of an already existing collection. As a result, the researchers used Google image search to perform automated searches to produce the collection. But who has made these subjective judgments? To answer why exactly is it that, among the 30,000 collected images, a photo of actor Samuel L. Jackson appears among the examples of “angry” is complex. When producing an interpretation of a new image, the data model reflects the training data and how and who created it. In this work, we wanted to draw a parallel between contemporary data and surveillance practices with those of the colonial photographic projects Antje was critiquing.


FORUM POST showing precarity of the project


In [ ]:
from urllib.parse import urljoin, quote as urlquote
import os

def get_public_url():
    """ assumes you are inside a subfolder of your public_html folder """
    user = os.environ.get("USER")
    rel_pwd = (os.path.relpath(os.getcwd(),os.path.expanduser("~/public_html")))
    return f"{user}/{urlquote(rel_pwd)}/"

public_url = get_public_url()
print (public_url)
In [ ]:
import csv
import json


want = ("één", "twee", "drie", "vier")
out = {}
with open("commonvoice/cv-corpus-6.1-singleword/nl/train.tsv") as fin:
    for row in csv.DictReader(fin, delimiter="\t"):
        s = row['sentence']
        if row['sentence'] in want:
            # print (f"{row['path']} {row['sentence']} {row['gender']}")
            print (f"{row['sentence']} {XPUB_URL}{row['path']}")
            if s not in out:
                out[s] = []

with open("counting_nl.json", "w") as fout:
    print (json.dumps(out, indent=2), file=fout)

Check the output

In [ ]:
# can ffmpeg transcode from a URL ?!
In [ ]:
!ffmpeg -i -y test.mp3
In [ ]:
!sox test.wav test_trim.wav silence 1 0.1 1%
In [ ]:
!sox test.wav test_trim.wav silence 1 0.1 1% -1 0.1 1%
In [ ]:
!ffmpeg -i test_trim.wav test_trim.mp3
In [ ]:
with open("counting_nl.json") as f:
    data = json.load(f)
    for key in data:
        print (f"key: {key}")
        for item in data[key][:5]:
            print (item['sentence'])
In [ ]:
import os
from unidecode import unidecode

ruby_data = {}
with open("counting_nl.json") as f:
    data = json.load(f)
    for key in data:
        print (f"key: {key}")
        ruby_data[unidecode(key)] = []
        for item in data[key][:5]:
            url = f"{item['path']}"
            print (url, item['sentence'])
            mp3 = "counting_nl/" + item['path']
            wav = mp3.replace(".mp3", ".wav")
            if not os.path.exists(wav):
                os.system("mkdir -p counting_nl")
                os.system(f"ffmpeg -i {url} -y tmp.wav")
                os.system("sox tmp.wav tmp_trim.wav silence 1 0.1 1% -1 0.1 1%")
                os.system(f"ffmpeg -i tmp_trim.wav {mp3}")
                os.system(f"mv tmp_trim.wav {wav}")
                os.system("rm tmp.wav")
            print (f"{public_url}{mp3}")
            ruby_data[unidecode(key)].append(item['path'].replace(".mp3", ".wav"))
            # print (item['sentence'])
with open("counting_nl.ruby.json", "w") as fout:
    print (json.dumps(ruby_data, indent=2), file=fout)
In [ ]:
!zip -r counting_nl
In [ ]: