Your problem needs time-frequency analysis https://en.wikipedia.org/wiki/Time%E2%80%93frequency_analysis. Generate some waterfall plots of time vs. frequency using windowed Fourier transforms. Inspect those, and/or use them them as input into your learning approaches. Depending on the approach you want to use, the waterfall plots can be analyzed like images with some caveats as the rows and columns represent entirely different physical measurements. That's the main place where scikit-image could potentially assist. This is its own sub-field of digital signal processing; now that you know the keyword to search against you can peruse a large body of literature to assist with your project. Josh On Friday, May 22, 2015 at 2:00:58 PM UTC-5, user783746 wrote:
I am writting a program to classify recorded audio phone calls files (wav) which contain atleast some Human Voice or Non Voice (only DTMF, Dialtones, ringtones, noise). I tried implementing simple VAD (voice activity detector) using ZCR (zero crossing rate) & calculating Energy, but these parameters confuse with DTMF, Dialtones files with Voice.
I also tried implementing a machine learning based approach using SVM (Support Vector Machine) and MFCC coefficients. The results were worse than previous approach.
I need someone to advice me little on this domain, I have no previous experience in machine learning or AI. I am willing to put in good amount of time in this domain.
I am comfortable working in MATLAB, scipy, numpy, scikit-learn, python.
Thank you