NEURAL-DRIVEN MULTI-BAND PROCESSING FOR AUTOMATIC EQUALIZATION AND STYLE TRANSFER
Conference paper, Proceedings of the International Conference on Digital Audio Effects, DAFx, 2025,
View abstract ⏷
We present a Neural-Driven Multi-Band Processor (NDMP), a differentiable audio processing framework that augments a static six-band Parametric Equalizer (PEQ) with per-band dynamic range compression. We optimize this processor using neural inference for two tasks: Automatic Equalization (AutoEQ), which estimates tonal and dynamic corrections without a reference, and Production Style Transfer (NDMP-ST), which adapts the processing of an input signal to match the tonal and dynamic characteristics of a reference. We train NDMP using a self-supervised strategy, where the model learns to recover a clean signal from inputs degraded with randomly sampled NDMP parameters and gain adjustments. This setup eliminates the need for paired input-target data and enables end-to-end training with audio-domain loss functions. In the inference, AutoEQ enhances previously unseen inputs in a blind setting, while NDMP-ST performs style transfer by predicting task-specific processing parameters. We evaluate our approach on the MUSDB18 dataset using both objective metrics (e.g., SI-SDR, PESQ, STFT loss) and a listening test. Our results show that NDMP consistently outperforms traditional PEQ and a PEQ+DRC (single-band) baseline, offering a robust neural framework for audio enhancement that combines learned spectral and dynamic control.
Diff-DEQ: Differentiable Dynamic Equalization for Studio-Quality Speech Processing
Sarkar P., Lindborg P.
Conference paper, European Signal Processing Conference, 2025, DOI Link
View abstract ⏷
We present Differentiable Dynamic Equalization (Diff-DEQ), a fully differentiable deep learning framework for speech equalization and enhancement to achieve studio quality for audio post-production tasks. Unlike fixed-rule equalization methods, it adapts spectral components dynamically, responding to input signal variations to attain precise and content-aware spectral shaping. The model combines a FiLM-modulated Temporal Convolutional Network (TCN) and a Bidirectional Gated Recurrent Unit (BiGRU) to predict per-band equalization parameters with audio feature-based conditioning for improved adaptability. We have trained the model in a self-supervised manner that eliminates the need for paired input-target data. We evaluate its performance using objective metrics on Diff-DEQ and parametric equalization (PEQ) across LibriTTS, DAPS, and VCTK datasets and non-intrusive speech quality assessment for subjective evaluation. Our results show that Diff-DEQ enhances speech intelligibility and perceived quality, making it well-suited for audio post-production.
1D-Touch: NLP-Assisted Coarse Text Selection via a Semi-Direct Gesture
Jiang P., Feng L., Sun F., Sarkar P., Xia H., Liu C.
Article, Proceedings of the ACM on Human-Computer Interaction, 2023, DOI Link
View abstract ⏷
Existing text selection techniques on touchscreen focus on improving the control for moving the carets. Coarse-grained text selection on word and phrase levels has not received much support beyond word-snapping and entity recognition. We introduce 1D-Touch, a novel text selection method that complements the carets-based sub-word selection by facilitating the selection of semantic units of words and above. This method employs a simple vertical slide gesture to expand and contract a selection area from a word. The expansion can be by words or by semantic chunks ranging from sub-phrases to sentences. This technique shifts the concept of text selection, from defining a range by locating the first and last words, towards a dynamic process of expanding and contracting a textual semantic entity. To understand the effects of our approach, we prototyped and tested two variants: WordTouch, which offers a straightforward word-by-word expansion, and ChunkTouch, which leverages NLP to chunk text into syntactic units, allowing the selection to grow by semantically meaningful units in response to the sliding gesture. Our evaluation, focused on the coarse-grained selection tasks handled by 1D-Touch, shows a 20% improvement over the default word-snapping selection method on Android.
Data-driven pause prediction for synthesis of storytelling style speech based on discourse modes
Sarkar P., Rao K.S.
Conference paper, 2015 IEEE International Conference on Electronics, Computing and Communication Technologies, CONECCT 2015, 2016, DOI Link
View abstract ⏷
In storytelling style, a storyteller generally uses prosodic variations with subtle speech nuances for the better apprehension of the listeners. It is achieved by emphasizing prominent words, using various emotions, mimicking voices and providing appropriate pauses. This work is a part of building the Story Text-to-Speech (TTS) [1] synthesis systems in Indian Languages, which aims at synthesizing the storytelling style speech from the neutral TTS. The neutral speech is converted to storytelling style by modifying the specific prosodic parameters (i.e. duration, pitch, tempo, intensity and pauses). The main contribution of this paper is to model the pause patterns present in storytelling style speech based on the modes of discourse: narrative, descriptive and dialogue to capture the story-semantic information. Analysis of pause patterns are carried out for children stories in Hindi language. We analyzed the pause patterns and classified pauses into three different categories: short, medium and long pauses for each mode of discourse. A three stage data-driven method is proposed to predict the position and duration of the pauses. We conducted objective test to evaluate the performance of the proposed method at each stage. Also, subjective evaluation is carried out on the final output of the Hindi Story-TTS system. The subjective evaluation connotes that the subjects have perceived an improvement in speech quality in terms of storytelling style.
Automatic pitch accent contour transcription for Indian languages
Reddy M.G., Sen P., Manjunath K.E., Dutta A., Haque A., Sarkar P., Rao K.S.
Conference paper, IEEE International Conference on Computer Communication and Control, IC4 2015, 2016, DOI Link
View abstract ⏷
In this paper, an automatic method to transcribe the pitch accent contour from the speech signal is presented. Pitch contour transcription refers to the labeling of temporal variations of the pitch contour of the speech signal with finite number of discrete labels. Pitch contour is derived from the zero frequency filtered (ZFF) speech signal. A non-linear smoothing technique is used to remove the spurious pitch values in the pitch contour. An intonation like contour is obtained by removing trend in the pitch contour. The location of the tonal variations in the intonation phrases is identified and assigned with appropriate tone label. Pitch contour transcription is derived using tonal labels and the corresponding timing information from the pitch contour. The Automatic pitch contour transcription system is evaluated using read, extempore and conversation modes of speech from 11 Indian languages. For each mode of speech, the speaker-wise subjective evaluation is carried out for 11 Indian languages to validate the correctness of the proposed automatic pitch contour transcription method.
Analysis and modeling pauses for synthesis of storytelling speech based on discourse modes
Sarkar P., Rao K.S.
Conference paper, 2015 8th International Conference on Contemporary Computing, IC3 2015, 2015, DOI Link
View abstract ⏷
Generally in Text-to-Speech synthesis (TTS) systems, pause prediction plays a vital role in synthesizing natural and expressive speech. In storytelling style, pauses introduce suspense and climax by emphasizing the prominent words or emotion-salient words in a story. The objective of this work is to analyze and model the pause pattern to capture the story-semantic information. The purpose of this paper is to define a stepping stone towards developing a Story TTS based on modes of discourse. In this work, we base our analysis of the pauses in Hindi children stories for each mode of discourse: narrative, descriptive and dialogue. After grouping the sentences into modes, we analyse the pause pattern to capture the story-semantic information. A three stage data-driven method is proposed to predict the location and duration of pauses for each mode. Both the objective as well as subjective test are conducted to evaluate the performance of the proposed method. The subjective evaluation indicates that subjects appreciates the quality of synthesized speech by incorporating the proposed model.
Data-driven pause prediction for speech synthesis in storytelling style speech
Sarkar P., Sreenivasa Rao K.
Conference paper, 2015 21st National Conference on Communications, NCC 2015, 2015, DOI Link
View abstract ⏷
In the storyteller speech, pauses plays a significant role in introducing suspense and climax. Pauses are used to emphasize keywords, emotion-salient words and separate the phrases in the utterance. The objective of this work is to predict the position and duration of the pauses in the synthesized speech from the text-to-speech system. We analyzed the pause patterns in storyteller speech and classified the pauses into three different categories, that is, short, medium and long pauses. A data driven three stage pause prediction model is proposed. In the first stage, the model is built properly to identify the pause position within an utterance using a set of word-level features. In the second stage, the pauses are classified into three different categories using a set of syllable-level features. In the final stage, a regression predictor is trained to predict the pause duration for each category. We conducted both objective and subjective tests to evaluate the proposed method. The subjective evaluation showed that subjects are perceiving a noticeable difference in the synthesized speech using the proposed method.
Conversion of neutral speech to storytelling style speech
Verma R., Sarkar P., Rao K.S.
Conference paper, ICAPR 2015 - 2015 8th International Conference on Advances in Pattern Recognition, 2015, DOI Link
View abstract ⏷
Speech is the most basic and widely used method for communication. There is a growing need for an expressive speech synthesis especially, when human want to communicate with robots and computers. In this paper, prosody rule-sets are designed to convert neutral to storytelling style speech for Hindi language. In order to generate a storyteller speech from a neutral speech, modification in various prosodic parameters such as pitch, intensity, duration, tempo and pauses are considered. For each of the above mentioned prosodic parameters rules are developed separately for story-specific emotions (such as sad, anger, fear, surprise and neutral). Theses rules are designed by performing analysis on stories collected from a professional storyteller. In this work, modification are done both at phrase and sentence-level. These rules are derived for both male and female speakers. Subjective tests are conducted to evaluate the quality of the generated storytelling style speech. Also, influence of speaker characteristics on neutral to story speech conversion are analysed.
Modeling Pauses for Synthesis of Storytelling Style Speech Using Unsupervised Word Features
Sarkar P., Rao K.S.
Conference paper, Procedia Computer Science, 2015, DOI Link
View abstract ⏷
In the storytelling style speech pauses or phrase breaks play a significant role in introducing suspense and climax in the story. More often pauses are provided by a storyteller to capture the audience's attention by emphasizing keywords, focusing emotion-salient words, and to separate key phrases in an utterance. The goal of the work presented in this paper is to predict the location of pauses, in an utterance synthesized by a Story Text-To-Speech (TTS) system using unsupervised features at word-level. Traditional methods for predicting pauses uses the foremost linguistic features like Parts-of-Speech (POS) tags, chunking information or terminal syllables, etc. These methods presuppose the availability of linguistic knowledge by an automatic tagger or manually annotated corpus. However, this information's are not readily available in case of Indian Languages. Manually annotating the text with this linguistic information is quite hectic and time consuming. Also, these pieces of information's do not capture the co-occurrence statistics of words. Hence, we propose a framework for integrating the Story TTS with proposed pause prediction module. In this module, an unlabeled text corpus is used to extract, the continuous-valued world-level features to model the pause patterns in storytelling speech. A set of story-specific (SS) features are introduced for capturing story-semantic information based on pause pattern. A various combination of pause predictions systems is- proposed such as B, POS, U, POS+SS and U+SS. These systems are evaluated objectively by F-1 Score.
Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for Indian languages: Bengali, Hindi and Telugu
Sarkar P., Haque A., Dutta A.K., Gurunath Reddy M., Harikrishna D.M., Dhara P., Verma R., Narendra N.P., Sunil Kr S.B., Yadav J., Rao K.S.
Conference paper, 2014 7th International Conference on Contemporary Computing, IC3 2014, 2014, DOI Link
View abstract ⏷
This paper provides a design of prosody rule-set for transforming the neutral speech synthesized by Text-to-Speech (TTS) system to storytelling style speech. The objective of this work is to synthesize storyteller speech from the neutral TTS system for a given story text as input. In this work, neutral TTS refers to TTS system developed using Festival framework with neutral speech corpus. For generating storyteller speech from neutral TTS, we are proposing modifications to various prosodic parameters of neutral synthesized speech. In this work, the prosodic parameters considered for modification are (i) pitch contour, (ii) duration patterns, (iii) intensity patterns, (iv) pause patterns and (v) tempo. We have designed individual rule-sets for the above mentioned prosodic parameters, separately for three Indian languages Bengali, Hindi and Telugu. The rule-sets are designed carefully by analyzing the perceptual differences between synthesized neutral speech utterances and their respective natural (original) spoken utterances, narrated by a storyteller. The designed prosody rule-sets are evaluated using subjective listening tests. The results of the perceptual evaluation indicate that the designed prosody rule-sets play a significant role in achieving the story-specific style during conversion from neutral to storytelling style speech.