This post discusses the importance of interactive machine learning (IML) for developing participatory algorithmic systems for non-expert users and presents a novel, open-source IML tool (Audio Soup) designed to perform sample review and feature selection for audio datasets. Code Repository: https://github.com/accraze/audio-soup Demo App: https://audio-soup.herokuapp.com

BACKGROUND

Artificial Intelligence & Machine Learning systems have taken over the world in recent years. Countless applications have been implemented across industry, government and academia, all of which aid users in making important decisions or predictions. In 2020, during Q1 alone, there was an estimated $6.9 billion USD raised for AI-based business (Venturebeat, 2020), which points to the proliferation and importance of these types of systems. Their predictions and decisions can affect broad groups of people, yet their inner-workings are often opaque to the majority of the population. Recently, there has been an emerging trend in research which suggests the possibility of creating interactive machine learning systems which focus on participation with communities of non-expert users. By lowering the barriers for less-technical stakeholders, these algorithmic systems become fair, transparent, contestable and accountable.

Interactive Machine Learning

Interactive Machine Learning seeks to democratize the Machine Learning Development Cycle by lowering the barriers of participation for non-expert users. This is often done by the use of intelligent interface components that keep humans in the loop when building and operating algorithmic systems. Users can be included within every step of the Machine Learning Development Cycle (MLDC) using a variety of methods. Collecting cognitive feedback from the user can be done through self-reporting, implicit feedback and modeled feedback, by using surveys, forms and interactive components (Michael et. al, 2020).

Machine Learning Development Cycle

Similar to Software Development, Machine Learning has specialized workflows that are commonly seen across projects. There are a number of stages within the Machine Learning Development Cycle (MLDC) that can benefit from generalized IML interfaces such as, sample review, feature selection, model selection, training, evaluation, and deployment (Dudley & Kristensson, 2018). The two stages that are most welcoming to non-expert users are sample review and feature selection, as they generally require the most human interaction to begin with. When viewed through an IML lense, there are new ways of thinking about building datasets which could increase transparency and contestability from the start. Additionally, feature selection often requires domain expertise and a strong understanding of machine learning, which can increase the barriers to participation. By developing intuitive and adaptable interfaces, non-expert users can explore the feature space of any given domain and experiment with different targets found within their data. Additionally, expert users may be unaware of certain features that might be available within their datasets as well. When diverse stakeholders participate in critical roles during the development of algorithmic systems, there is a noticeable increase in fairness, accountability and transparency (Denton et. al, 2020).

Data Curation & Sample Review

The first major task that requires “human-in-the-loop” interaction is dataset curation and sample review. While users may have already been involved in collecting data, there has been a push towards increasing inclusivity by improving annotations and adding contestability features to combat bias. Researchers have made a case to treat datasets similar to how archives and libraries collect data. This means including additional metadata like socio-cultural information, in an attempt to reduce bias and embed the stakeholder’s values within the system (Jo et. al, 2020). Adding more robust information alongside notations allows users to understand the ‘who’ and the ‘why’ behind a dataset. Another important approach is allowing datasets to be contestable, meaning that stakeholders could refute samples within the dataset if they feel it is incorrect. Research has shown that by including a more diverse set of data curators, algorithmic systems become more transparent and have less risk of containing creator bias (Denton et. al, 2020).

Feature Selection

The next step in the MLDC workflow is Feature Selection, where facets of the data are selected in order to boost the signal of the targeted outcome. This stage can benefit from IML methods, specifically in aggregating crowd-sourced features. In many systems, a large base of users can be leveraged into finding misclassification features, and research shows that this approach can improve exploration and prototyping of features across tasks such as deception detection and quality assessment (Cheng & Bernstein, 2015). It’s important to note that this seems to be the most straightforward when working within text-based domains, although some research has been done showing promising results in image-based domains as well (Dudley & Kristensson, 2018). One major challenge for feature selection interfaces is how best to visualize time-series data, like audio for sound and speech recognition. This is due to a large feature space and varying level of technical skill possessed by the user (Ishibashi et. al, 2020). This difficulty points towards a need for user-friendly, adaptive interfaces that support varying skill levels, especially within the domain of audio data.

In terms of developing interactive interfaces for sample review and dataset annotations for audio, a recent tool called Edyson presented audio embeddings within a cluster space for a novel approach to annotation (Fallgren et. al 2018). The Munich versatile and fast open-source audio feature extractor, also known as “Open Smile”, has been successful in efficiently extracting features from large-scale audio datasets (Eyben et. al, 2010). Interactive Sound Recognition GUIs have been explored using a variety of different techniques, with visual representations alongside spectrograms providing the most consistent understanding amongst technical and non-technical users when annotating audio data (Ishibashi et. al, 2020).

While much of the aforementioned work related to IML audio tools has centered around expert users, research has shown that providing summary representations for both text and time series data is difficult to scale while still providing enough meaningful information for non-expert users (Dudley & Kristensson, 2018). For this reason, IML researchers have found that interfaces should also be “adaptive”, or hide expert components for less technical users, which would allow a wider audience the ability to participate with IML tools (Yang et. al, 2018). Researchers have also pointed out that the dataset curation and annotation process risks having creator bias embedded deep within it’s data (Jo et. al, 2020). Interfaces must be designed so that stakeholders can understand and refute the algorithmic decisions (Lee et. al, 2017). Additionally, datasets should be comparable over time and by other types of metadata (like source or origin) (Hohman et. al, 2020).

AUDIO SOUP

While taking into account the previous work done in the space of audio tools and Interactive Machine Learning interfaces, Audio Soup attempts to answer a number of challenges proposed in the recent literature. This open source tool is designed specifically to address the tasks of Sample Review and Feature Selection for non-expert users and is provided free of cost under the permissive MIT License for anyone who is interested in using it.

Architecture

Audio Soup is a browser-based interface written in Python, with a SQL database powered by Postgres. It is currently packaged as a Docker container for maximum portability across operating systems. Command Line Utilities are provided to assist in loading a datasets into the application. For development and demo purposes, the tool comes with an optional dataset which is a subset of the Google Speech Commands dataset (Warden, 2018).

There are two primary routes that users can navigate: the grid view for Sample Review and the Feature Selection view. Both of these views use a modal card based layout using a CSS Framework called Bulma. The card layout was chosen for simplicity and familiarity to other browser-based workflows. The following section will describe these views in greater detail.

Sample Review (Grid View)

The root view of the application is called the “grid view” and displays audio files in a paginated card-based layout. The samples can be filtered by label or viewed as a whole. The audio waveform images are dynamically generated during page load using a Flask context processor to encode a png image to a Base64 encoded string, then injected into the presentation template for display. This is a different approach than the one mentioned by Ishibashi et. al in [6], as we have access to the audiofile available on disk and do not need to convert from video. Due to performance reasons, we have limited the pagination to only display 10 cards per page for the initial prototype. The following figure (Figure 1) shows the grid view with modal cards, pagination links and audio waveform images.

Each card can be reviewed by clicking the “Review” button at the bottom of the card, which will pull up a modal card that displays the audio waveform, it’s metadata and an audio player that allows users to listen to the sample. There are also buttons that allow the user to edit the sample’s metadata or link to the Feature Selection view for that specific audio sample. The following figure (Figure 2) shows the modal card displaying the metadata and audio player.

Feature Selection

The second primary view is the Feature Selection view, accessed from the modal card of any given audio sample within the loaded dataset. The purpose of this view is to visualize the feature space and allow users to see explanations of each specific feature. Users are also given the option to select features and export them as JSON files for the entire dataset. The features are organized into three categories: Spectral, Rhythmic and Deltas. The following figure (Figure 4) shows the available spectral features, which include Mel Spectrogram, Tonal Centroid (Tonnetz) and Spectral Contrast. The spectral features allow users to explore the spectrum of frequencies and pitches across time.

Similar to the grid view, the features in the Feature Selection view can be clicked on to reveal a modal card with a basic explanation of the given feature, along with a link to a corresponding Wikipedia page or relevant research paper. This is an attempt to make these features more friendly to non-expert users who may wish to learn more about the feature space of an audio dataset. The following figure (Figure 5) shows the modal card explanation for Mel Spectrogram.

The next category of available audio features are the “Rhythmic” features, which show the prevalence of certain speeds at each moment in time for a given audio sample. The currently available rhythmic features are Tempogram and Fourier Tempogram, which are two different ways to find indications of onset speeds within a sample. The following figure (Figure 6) shows the Feature Selection view with the Rhythmic features selected. Again, clicking on each feature will bring up a modal card with more information about the feature and relevant links to Wikipedia pages or white papers.

The final category of available audio features are called the “Delta” features, which are essentially manipulations of features already seen for an audio sample. The MFCC deltas are similar to the Mel Spectrogram, although we investigate the derivative and second order derivative to find underlying patterns. The stack memory gives users the ability to concatenate delayed copies of the tonnetz/chromagram across time. The following figure (Figure 6) shows the Feature Selection view with the Delta features selected.

NEXT STEPS

Currently the v0.1.0 prototype is published as a container on DockerHub which contains all features and views mentioned in the previous two sections. There are a number of new features which should be added in forthcoming releases, specifically the ability to compare features across different audio samples. Work has been started on the comparison view, although performance and UX needs a considerable amount of work before releasing it to the public. The Feature Selection export module still needs to be able to handle other file formats rather than just JSON. These options should be expanded to encompass other industry standard formats like CSV, YAML, Plain text and more. These were left out of the v0.1.0 release in the interest of time, however, there are numerous cases where JSON may not be sufficient. Additionally, the “Deltas” features should be expanded into something like a “Feature Manipulation” or “Feature Augmentation” filter, which would allow users to perform different types of transformations or filters across all the files in a dataset. A sandbox would be helpful for users to experiment with adding different manipulation types prior to applying the changes across the dataset. Another feature would be expanding the annotations section in the Sample Review grid view. For now the text property is a free-text field, although semantic text representation could be added as a way to help users create more robust annotations as mentioned by Ishibashi et. al in [6]. Lastly, a round of user testing should be done, with both non-expert users and users who have DSP and Machine Learning expertise, in order to gauge how effective the application is in lowering the barriers to participation. It would be interesting to hear what each group of users think are necessary features and where more time should be spent in future development. In the spirit of open-source and free software, all of these ideas for new development have been added as “Issues” in Github and are open to the community to discuss. The link to the issue tracker can be found at: https://github.com/accraze/audio-soup/issues

CONCLUSION

It is still early days for Interactive Machine Learning systems, although a number of unique studies and tools have been underway in recent years. Building interfaces for varying levels of technical skill has proven to be difficult, although researchers have proposed developing interfaces which can assist non-expert users in participating in the Machine Learning Development Cycle. Image and text-based domains seem to work the best with current IML tooling, while time series data, like audio, seems to be more of a challenge to accurately visualize for non-experts within an ML context. In this paper we have presented a novel, IML tool for non-experts to participate in the Sample Review and Feature Selection process for audio data. The current prototype (v0.1.0) allows for free text annotations in the file metadata as an attempt to avoid creator bias as mentioned by Jo et. al in [3]. It displays basic explanations for each available feature that can be extracted from a given audio file, while attempting to strike a balance to avoid additional cognitive overhead as mentioned by Pu & Chen in [13]. It is released freely under the MIT license for use by researchers and enthusiasts alike.

REFERENCES

  1. Dudley, J. J., & Kristensson, P. O. (2018). A review of user interface design for interactive machine learning. ACM Transactions on Interactive Intelligent Systems, 8(2), 1–37. https://doi.org/10.1145/3185517
  2. Michael, C. J., Acklin, D., & Scheuerman, J. (2020). On interactive machine learning and the potential of cognitive feedback. ArXiv:2003.10365 [Cs]. http://arxiv.org/abs/2003.10365
  3. Jo, E. S., & Gebru, T. (2020). Lessons from archives: Strategies for collecting sociocultural data in machine learning. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 306–316. https://doi.org/10.1145/3351095.3372829
  4. Denton, E., Hanna, A., Amironesei, R., Smart, A., Nicole, H., & Scheuerman, M. K. (2020). Bringing the people back in: Contesting benchmark machine learning datasets. ArXiv:2007.07399 [Cs]. http://arxiv.org/abs/2007.07399
  5. Cheng, J. and Michael S. Bernstein. (2015). Flock: Hybrid Crowd-Machine Learning Classifiers. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW ’15). ACM, New York, NY, USA,600–611. https://doi.org/10.1145/2675133.2675214
  6. Ishibashi, T., Nakao, Y., & Sugano, Y. (2020). Investigating audio data visualization for interactive sound recognition. Proceedings of the 25th International Conference on Intelligent User Interfaces, 67–77. https://doi.org/10.1145/3377325.3377483
  7. Fallgren, P., Malisz, Z., & Edlund, J. (2018, May). Bringing order to chaos: A non-sequential approach for browsing large sets of found audio data. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). LREC 2018, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1680
  8. F. Eyben, M. Wöllmer, and B. Schuller, “OpenSmile: the Munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010, pp. 1459–1462.
  9. Yang, Q., Suh, J., Chen, N.-C., & Ramos, G. (2018). Grounding interactive machine learning tool design in how non-experts actually build models. Proceedings of the 2018 on Designing Interactive Systems Conference 2018 - DIS ’18, 573–584. https://doi.org/10.1145/3196709.3196729
  10. Lee, M. K., Kim, J. T., & Lizarondo, L. (2017). A human-centered approach to algorithmic services: Considerations for fair and motivating smart community service management that allocates donations to non-profit organizations. Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 3365–3376. https://doi.org/10.1145/3025453.3025884
  11. Hohman, F., Wongsuphasawat, K., Kery, M. B., & Patel, K. (2020). Understanding and visualizing data iteration in machine learning. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–13. https://doi.org/10.1145/3313831.3376177
  12. Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. ArXiv:1804.03209 [Cs]. http://arxiv.org/abs/1804.03209
  13. Pu, P., & Chen, L. (2006). Trust building with explanation interfaces. Proceedings of the 11th International Conference on Intelligent User Interfaces, 93–100. https://doi.org/10.1145/1111449.1111475
  14. AI startups raised $6.9 billion in Q1 2020, a record-setting pace before coronavirus. (2020, April 14). VentureBeat. https://venturebeat.com/2020/04/14/ai-startups-raised-6-9-billion-in-q1-2020-a-record-setting-pace-before-coronavirus/