This paper describes our progress of automatic caption generation project. VCML project has developed video caption markup language and its player (VCML and VCML Player) to reduce labor and cost of making captions. Voice-Pause method\cite{Suzuki-MMSP2001}, which was originally developed to align between voice intervals and their corresponding written text, is improved to align sound data containing both voice and music intervals. The results of the alignment experiment show that the improved method, Voice-Music-Pause method, can align both voice, music and pause intervals effectively.