Showing posts with label Math. Show all posts
Showing posts with label Math. Show all posts

Thursday, May 21, 2009

Wu Jun's Beauty of Mathematics

Beauty of Mathematics

http://jun.wu.googlepages.com/beautyofmathematics

数学之美

(Written in Chinese)

I am writing a serial of essays introducing the applications of math in natural language processing, speech recognition and web search etc for non-technical readers . Here are the links

0. Page Rank ( 网页排名算法 )

1. Language Models (统计语言模型)

2. Chinese word segmentation  (谈谈中文分词)

3. Hidden Markov Model and its application in natural language processing (隐含马尔可夫模型)

4. Entropy - the measurement of information (怎样度量信息?)

5. Boolean algebra and search engine index (简单之美:布尔代数和搜索引擎的索引)

6. Graph theory and web crawler (图论和网络爬虫 Web Crawlers)

7. Information theory and its applications in NLP  (信息论在信息处理中的应用)

8. Fred Jelinek and modern speech and language processing (贾里尼克的故事和现代语言处理)

9. how to measure the similarity between queries and web pages.  (如何确定网页和查询的相关性)

10. Finite state machine and local search (有限状态机和地址识别)

11. Amit Singhal: AK-47 Maker in Google (Google 阿卡 47 的制造者阿米特.辛格博士)

12. The Law of Cosines and news classification (余弦定理和新闻的分类)

13.  Fingerprint of information and its applications (信息指纹及其应用)

14. The importance of precise mathematical modeling (谈谈数学模型的重要性)

15. The perfectionism and simplism 繁与简 自然语言处理的几位精英

16. Don't put all of your eggs in one basket - Maximum Entropy Principles 不要把所有的鸡蛋放在一个篮子里 -- 谈谈最大熵模型(A)

17. Don't put all of your eggs in one basket - Maximum Entropy Principles不要把所有的鸡蛋放在一个篮子里 -- 谈谈最大熵模型(B)

18.  闪光的不一定是金子 谈谈搜索引擎作弊问题(Search Engine Anti-SPAM)

19.  Matrix operation and Text classification 矩阵运算和文本处理中的分类问题

20. The Godfather of NLP - MItch Marcus 自然语言处理的教父 马库斯

21. The extension of HMM, Bayesian Networks 马尔可夫链的扩展 贝叶斯网络

22. The principle of cryptography 由电视剧《暗算》所想到的 — 谈谈密码学的数学原理

23. How many keys need we type to input a Chinese character 输入一个汉字需要敲多少个键 — 谈谈香农第一定律

吴军主页的中文首页

吴军 (Jun Wu) 的英文首页

Sunday, May 17, 2009

Hit rate and False alarm rate

 


From: http://www.ecmwf.int/products/forecasts/guide/Hit_rate_and_False_alarm_rate.html


Verification measures
like the RMSE and the ACC will value equally the case of an event
being forecast, but not observed, as an event being observed but
not forecast. But in real life the failure to forecast a storm that
occurred will normally have more dramatic consequences than
forecasting a storm that did not occur. To assess the forecast
skill under these conditions another type of verifications must be
used.



For any threshold (like
frost/no frost, rain/dry or gale/no gale) the forecast is
simplified to a yes/no statement (categorical forecast). The
observation itself is put in one of two categories (event
observed/not observed). Let H denote "hits", i.e. all correct
yes-forecasts - the event is predicted to occur and it does occur,
F false alarms, i.e. all incorrect yes-forecasts, M missed
forecasts (all incorrect no-forecasts that the event would not
occur) and Z all correct no-forecasts. Assume altogether N
forecasts of this type with H+F+M+W=N. A perfect forecast sample is
when F and M are zero. A large number of verification scores
13 are computed from
these four values.





A forecast/verification table




















forecast\obs



observed



not obs



forecast



H



F



not forecast



M



Z




The frequency bias
BIAS=(H+F)/(H+M), ratio of the yes forecast frequency to the yes
observation frequency.



The proportion of
correct PC=(H+Z)/N, gives the fraction of all the forecasts that
were correct. Usually it is very misleading because it credits
correct "yes" and "no" forecasts equally and it is strongly
influenced by the more common category (typically the "no"
event).



The probability of
detection POD=H/(H+M), also known as Hit Rate (HR), measures the
fraction of observed events that were correctly forecast.



The false alarm ratio
FAR=F/(H+F), gives the fraction of forecast events that were
observed to be non events.



The probability of
false detection POFD=F/(Z+F), also known as the false alarm rate,
is the measure of false alarm given the vent did not occur. POFD is
generally associated with the evaluation of probabilistic forecast
by combining it with POD into the Relative Operating Characteristic
diagram (ROC)



A very simple measure of
success of categorical forecasts is the difference POD-FAR which is
known as the Hansen-Kuiper or True Skill Score. Among other
properties, it can be easily generalised for the verification of
probabilistic forecast (see 7.4 below).

Monday, May 11, 2009

Using FFT in MATLAB

From: http://www.mathworks.com/products/matlab/demos.html?file=/products/demos/shipping/matlab/sunspots.html

This demonstration uses the FFT function to analyze the variations in sunspot activity over the last 300 years.

Sunspot activity is cyclical, reaching a maximum about every 11 years. Let's confirm that. Here is a plot of a quantity called the Zurich sunspot relative number, which measures both number and size of sunspots. Astronomers have tabulated this number for almost 300 years. Search Google for the data.

load sunspot.dat
year=sunspot(:,1);
relNums=sunspot(:,2);
plot(year,relNums)
title('Sunspot Data')



Here is a closer look at the first 50 years.



plot(year(1:50),relNums(1:50),'b.-');



The fundamental tool of signal processing is the FFT, or fast Finite Fourier Transform. To take the FFT of the sunspot data type the following.



The first component of Y, Y(1), is simply the sum of the data, and can be removed.



Y = fft(relNums);
Y(1)=[];


A graph of the distribution of the Fourier coefficients (given by Y) in the complex plane is pretty, but difficult to interpret. We need a more useful way of examining the data in Y.



plot(Y,'ro')
title('Fourier Coefficients in the Complex Plane');
xlabel('Real Axis');
ylabel('Imaginary Axis');



The complex magnitude squared of Y is called the power, and a plot of power versus frequency is a "periodogram".



n=length(Y);
power = abs(Y(1:floor(n/2))).^2;
nyquist = 1/2;
freq = (1:n/2)/(n/2)*nyquist;
plot(freq,power)
xlabel('cycles/year')
title('Periodogram')



The scale in cycles/year is somewhat inconvenient. We can plot in years/cycle and estimate the length of one cycle.



plot(freq(1:40),power(1:40))
xlabel('cycles/year')



Now we plot power versus period for convenience (where period=1./freq). As expected, there is a very prominent cycle with a length of about 11 years.



period=1./freq;
plot(period,power);
axis([0 40 0 2e+7]);
ylabel('Power');
xlabel('Period (Years/Cycle)');



Finally, we can fix the cycle length a little more precisely by picking out the strongest frequency. The red dot locates this point.



hold on;
index=find(power==max(power));
mainPeriodStr=num2str(period(index));
plot(period(index),power(index),'r.', 'MarkerSize',25);
text(period(index)+2,power(index),['Period = ',mainPeriodStr]);
hold off;

Wednesday, March 18, 2009

Using Maple in command line

In Command-line Maple, type following command to execute the input file:

read "D:/model.mpl";

Also we can use following command to redirect the output to a text file:

[Installation Folder]\Maple 12\bin.win\cmaple.exe "D:/model.mpl" > res.txt
Google+