Fundamental Frequency Determination:
Fundamental frequency estimation has consistently been a diﬃcult topic in audio signal processing. Many context-speciﬁc attempts have been made, and many of them work well in their speciﬁc context, but it has been diﬃcult to develop a “context-free” f0 estimator. f0 estimators developed for a particular application, such as musical note detection or speech analysis, are well understood, but depend on the domain of the data: a detector designed for one domain is less accurate when applied to a diﬀerent domain. The result is that there are many f0 estimators currently on the market, but few that are appropriate to more than one domain.
Therefore, choosing a f0 estimator for a speech/song discrimination is a diﬃcult task because detectors that work well for music, and hence for song, work less well for speech, and vice versa. Three possible solutions to this problem are: to ﬁnd a detector that is reasonably good for both speech and song; build a detector that works very well for both speech and song; or use two f0 estimators, one suited to speech and one suited to song, and compare the results. The latter generates two positive outcomes: the f0 estimation is more reliable, and the diﬀerences between the f0 estimations can be used as a classiﬁcation feature between speech and song. f0 estimators developed for speech and for instrumental music were found, but not speciﬁcally for vocal music. For this reason, it was decided to evaluate a set of f0 estimators and choose one which was mostly accurate for both speech and song.