The speech data had already segmented.
File from 20060314-1-003390 to 20060314-1-008300.
Total number of sentences is 492.
The word accuracy is 79.35.
------------------------ Overall Results --------------------------
SENT: %Correct=14.43 [H=71, S=421, N=492]
WORD: %Corr=84.74, Acc=79.35 [H=7184, D=410, S=884, I=457, N=8478]
======================================================
2009年12月21日 星期一
2009年12月19日 星期六
Shot Boundary Detection
The shot boundary detection program has done.
There is only Kirsch mask part, without adding human detection.
I implement the frame similarities formula in reference paper.
Using FCM method to decide the threshold.
However the elementery result is not so good.
There are three lines represent three FCM clustering centers.
We use the value of the median cluster center ( red line ) as the threshold.
X axis means the similarity of two successive frame, and Y axis means the frame index.
Since the lect
According to observation, I guess that some noise at the end of the lecture video may cause the computation of threshold in FCM.
Therefore, I'll try remove removing the noise frames at the end of the video data.
The following is the result after removing the noise frames (start from 2192th frame).
There is only Kirsch mask part, without adding human detection.
I implement the frame similarities formula in reference paper.
Using FCM method to decide the threshold.
However the elementery result is not so good.
There are three lines represent three FCM clustering centers.
We use the value of the median cluster center ( red line ) as the threshold.
X axis means the similarity of two successive frame, and Y axis means the frame index.
Since the lectAccording to observation, I guess that some noise at the end of the lecture video may cause the computation of threshold in FCM.
Therefore, I'll try remove removing the noise frames at the end of the video data.
The following is the result after removing the noise frames (start from 2192th frame).
2009年12月10日 星期四
Project Proposal ( update )
Problem - Low-Quality Lecture Video Segmentation
Since large amount of lecture video data is not easy for students to browse,here we want to structure these video in a meaningful way - slide structure.
Let user can view the content he insterest more efficientlly.
Reference
"Structuring and Analyzing L0w-Quality Lecture Videos",ICASSP 2009
Method
According to observation, we find that scenes of lecture video mainly order by slides.
However the shot boundary detection is more difficult than general video datas, because the variation of lecture video datas is only in slide content.
Following we will take adavantage of video and speech information to improve performance.
1. Video Part
1.1) Human detection :
Remove the noise of human action
1.2) Edge detection ( by Kirsch Mask) :
We can hardly use color histogram based method to deal with, because of backgrounds of slides are nearly similar.
1.3) Compute frame similarities using edge information :
if there is a large difference between two successive frames, we believe it is a shot.
2. Speech Part
2.1) Auto speech recognition:
Since teacher say something related with slide content, we use ASR output and original slide file to improve the performance .
Dataset
Corpus: NTU Digit Speech Processing Lecture Video ( about 45hr )
TrainSet ( for acoustic model training ) : 12hr ( about 16min each 1hr )
TestSet: about 37 min
Evaluation
Ground Truth : Manual alignment with slide.
We will evaluate performance by precision and recall value.
Since large amount of lecture video data is not easy for students to browse,here we want to structure these video in a meaningful way - slide structure.
Let user can view the content he insterest more efficientlly.
Reference
"Structuring and Analyzing L0w-Quality Lecture Videos",ICASSP 2009
Method
According to observation, we find that scenes of lecture video mainly order by slides.
However the shot boundary detection is more difficult than general video datas, because the variation of lecture video datas is only in slide content.
Following we will take adavantage of video and speech information to improve performance.
1. Video Part
1.1) Human detection :
Remove the noise of human action
1.2) Edge detection ( by Kirsch Mask) :
We can hardly use color histogram based method to deal with, because of backgrounds of slides are nearly similar.
1.3) Compute frame similarities using edge information :
if there is a large difference between two successive frames, we believe it is a shot.
2. Speech Part
2.1) Auto speech recognition:
Since teacher say something related with slide content, we use ASR output and original slide file to improve the performance .
Dataset
Corpus: NTU Digit Speech Processing Lecture Video ( about 45hr )
TrainSet ( for acoustic model training ) : 12hr ( about 16min each 1hr )
TestSet: about 37 min
Evaluation
Ground Truth : Manual alignment with slide.
We will evaluate performance by precision and recall value.
訂閱:
意見 (Atom)