Global Journal of Computer Science and Technology, D: Neural & Artificial Intelligence, Volume 22 Issue 1

II. D ataset a) Collection of Dataset Data is very important in every machine learning and deep learning project or deep learning research. The data, we required for our research was, the audio files of people speaking some sentences that we choose. These sentences, to some extent captured a wide range of accent changes in the spoken Kashmiri Language. In total, 20 sentences were chosen for research purposes and people were recorded speaking these sentences in their native accents of Kashmir language. The data was collected from 5 districts or areas of Kashmir and all these files were saved with the extension of 'ogg', which in preprocessing, were changed to 'wav' format. In total, we got almost100 voice samples from each area and thus we had, 500 total voice samples of these sentences spoken by different people. b) Making of dataset The data we had, were audio files and we decided on getting the MFCC and Mel-Spectrograms from these audio files. So, our final dataset consisted of images of MFCC and Mel-Spectrograms. Since deep learning models require huge amount of data, we had to augment the data to increase the size of our dataset. There are many great techniques of augmenting the data, when it comes to images and audio. Since the images were of features, we could not use the normal augmentation techniques like distortion, rotation and many more[reference]. A special kind of augmentation known as specAugment [20], which produces augmented images on spectrograms was used. This augmentation performed following operations on the Images of Mel-Spectrograms. 1) Frequency masking is where certain part of the frequency is masked out, and 2) Time masking, where certain part of time is masked out. Even though, we performed augmentation on Mel- Spectrogram images, the data was not enough. so, we had to perform the augmentation on the audio files also. The audio files were augmented by increasing the speed, pitch and amplitude of the audio files, thus giving somewhat variability in the initial dataset of audio files. After performing, such augmentations we had large sufficient dataset for deep learning. III. P roposed S ystem a) Architecture We used CNN based architecture with ReLU activation function for internal nodes and SoftMax function to output the probability distribution of output classes. CNN [19] models show state of the art performance with image data. Since our motive was to extract the features from the audio data and plot them as images and then those images were input to the model, so we chose the model based on CNN architecture. Our model has, six convolution layers and six Maxpooling layers, with 5 dense layers and a flatten layer. [image of model] Accuracy and loss varied based on the feature used and learning rate of the model. We choose different learning rates based on different features that were input to the model. Our models were trained on learning rates between 0.001 to 0.0001 b) Features Used Many features have been used in researches of audio processing. We decided to keep our research simple so we decided on two features, MFCC and Mel Spectrograms. MFCC have been found to perform well in case audio classification [21] purposes and Mel spectrograms and Spectrograms have also shown such performance in many cases [19]. Different number of coefficients can be used, mostly 13 are taken. The selection of such number of constants, depends on the problem in hand. We experimented on various number of coefficients and finally decided on 13, 24 and 36 coefficients. These features were extracted and plotted as images and then such images were input to our model. Mel Spectrograms - A Mel spectrogram is a spectrogram that converts the frequencies to the Mel scale. When the spectrogram from the audio file is plotted using Mel scale, we get the Mel-Spectrogram. These spectrograms were plotted as images, same as the MFCCs and given input to the model. All these operations of feature extraction were done using librosa library [22], which makes working with audio very easy. IV. E xperimental S etup And R esults Different experiments were performed on different features and different learning rates were set during the training of the models. The features, were stored in two ways - Images of the features were generated, and in other features were extracted and a dimension was added, no image was generated and the features were stored in JSON data format. Below, we show the results of various experiments: a) Experiment 1 This Experiment was done using images of Mel spectrograms and MFCC for the CNN having 3 color channels. In this experiment, the images were generated from the audio files. Those images were saved and later were loaded back into the model. The models were trained on those images and evaluated on the validation and testing sets. i. Mel Spectrograms The below figure shows the metrices graphically and we can conclude from the graph that the model is showing state of art results on our data. © 2022 Global Journals Global Journal of Computer Science and Technology Volume XXII Issue I Version I 26 ( )D Acoustic Features based Accent Classification of Kashmiri Language using Deep Learning Year 2022