## Tue, 21 Apr 2009

## Statistics Problem

Is there any statistician reading this blog? Can you recommend any statistics community (web forum, mailing list, anything) where I can ask questions about one problem I am currently trying to solve? For those with login to IS MU, the description will for some time be also in the discussion forum of Faculty of Science. The problem is this:

I have a random variable with probability of *exp(-a*t)*
for some constant *a* and the time *t* (think radioactive decay,
but the real problem is something different). The problem is to calculate
the constant *a* from the observed data.

The measurements I have are in the form of a set of pairs
*(t _{i}, + or -)*, with the following meaning:
At time 0, take a brand new "

*i*-th atom", verify that it is not decayed, wait for the time

*t*, and look at it again. If the atom has not decayed yet, add a

_{i}*(t*pair to the set of measurements. Otherwise, add

_{i}, +)*(t*. Continue with the next new atom an the next time

_{i}, -)*t*.

_{i+1}
Note however, that the times *t_i* are given to me from the outside,
I cannot choose them, and they do not have any particular distribution
(e.g. being equally distributed between time of zero and some large number).
Also, the number of measurements is quite small (several hundreds at most).

You can download a Perl script for generating the test data, the test data (100 rows),
and the large test data (10,000
rows)
generated by this script. Can you somehow compute which constants *a*
have been used when generating these sets of data? If so, how could it be
done? And how can I estimate how accurately the *exp(-a*t)* curve fits
the real data?

## 4 replies for this story:

### J.C. wrote:

I haven't studied your problem to such an extent that I fully understand it, so just an idea: how about applying the logarithm function to the data (y-values)? This could make things a bit easier, as it should transform the data with some-obscure-exponential-dependency to linear-like dependency.

### Yenya wrote: Re: logarithms

Yes, log(y) transforms the exponential regression to a linear one. However, log(what?). I have a + or - (i.e. 1 or 0) values. So which points to apply the linear regresson to? 0 and -infinity does not sound feasible to me (remember, the data is relatively sparse and even with discretisation of the intervals you get the intervals with only - values.

### Luinar wrote: Solving

Sorry that I won't bother with English but in Czech it will be much faster to explain the solution. Pro každý čas t_i si můžeš spočíst pravděpodobnost jestli dostaneš + nebo -: P(t_i,+) = exp ( -alpha t_i ) P(t_i,-) = 1 - exp ( -alpha t_i) Tj. pro každou částici ve výsledku jsi nyní schopen spočíst její pravděpodobnost jako funkci parametru alfa. No a vzhledem k tomu, že dané částice jsou nezávislé pak, pravděpodobnost toho, že dostaneš tato konkrétní data je součin pravděpodobností od jednotlivých částic. Příklad: Výsledky: 1 + 2 - 3 - 4 + 5 - Pravděpodobnost daného výsledku je: exp(-alpha) [1-exp(-2*alpha)] [1-exp(-3*alpha)] exp(-4*alpha) [1-exp(-5*alpha)] No a parametr alpha hledáš takový, aby maximalizoval tuto pravděpodobnost. Po drobných úpravách a hraní si s tím (kvůli kompaktnosti = není to jediný možný zápis) dostaneš analytický vzorec (LaTeX konvence): \sum\limits_{i=1}^N t_i \frac{ \mathrm e^{-\alpha t_i} - h_i }{ 1 - \mathrm e^{-\alpha t_i}} = 0 kde N je počet atomů, t_i jsou časy jednolivých pozorování a h_i jsou výsledky ve tvaru 0 pro -, 1 pro +. Tahle rovnice se už musí řešit numericky, jako hint je dobré si všimnout, že suma jako celek je funkcí klesající v alpha a tedy půjde to dobře řešit půlením intervalu nebo nějakou pokročilejší gradientní metodou. Pokud budou další dotazy směřujte je kdyžtak na můj mail ("Moje přezdívka" na Seznamu - k okénku na vyplnění nemám důvěru neb nevím jestli se nezobrazí veřejně a v idealním tvaru pro spamboty).

### Yenya wrote: Re: Solving

Your solution is right. I am sorry, I forgot to update this blog entry, but I already had this solution - a friend of my colleague solved it about a day or two after the above blog post has been published. But thanks anyway - I hope it has been a nice mental exercise for you :-) Now the other problem is the last sentence from the blog post: when the data is a bit noisy, I would like to estimate how well the probability fits to the exponential curve. Of course, some estimate can be taken from the probability from your blog post, but the value of this highly depends on both number of measurements, and t_i values. I would like to have something which could be used for comparing even data sets with different number of measurements and/or different distributions of t_i values.