You are not logged in.
Consider a fictional hypothetical illness. Let us call it "monthitis". It is thought that there is a serious reason why it might be more
common for people born in a particular month to get the condition. A study is to be done, but we do not have any advanced knowledge
of which month it would be that is more common. A month is to be modelled for simplicity in a way so that each month is the same
length that is to say the fact that it may be 28,29,30 or 31 days long can be disregarded for an initial model to avoid unnecessary high
precision.
(Obviously if you can program a computer to work out a better solution then by all means go ahead. I have used a binomial model with
normal approximation without continuity correction to get an estimate, really a multinomial model is best here, but I cannot do that
really without a computer package to help, because the calculations are likely to be very complicated and error prone unless someone
has a bright idea of an easy way of doing this in a simplified fashion analytically by hand.)
A normal approximation to a binomial model with p = (1/12) and n = 1000 (sample size 1000) obviously q = (11/12) will be considered
preferable to make this something that could be done using a calculator and a good table of Normal Z values.
Correct me if I am wrong, but this is what I think:
mean = np = 83.333...
variance = npq = 76.388...
Therefore we want the model N(83.333, 76.388...) [N = Normal distribution with mean then variance]
The standard deviation is 8.74 (approx.)
Now how many standard deviations do we need for a robust piece of evidence that allows the null hypothesis to be conventionally
rejected at the 1% level of significance? [You may be able to phrase that in better statistical terms]
Should we divide the value SP = 0.01 by 12 to allow for the number of months not being stated in advance ?
This intuitively seems right to me, but is it correct to divide by 12 statistically speaking ?
A value of SP = 0.005 (halving the SP to account for a two-tailed test) gave me 2.56 standard deviations.
However if you divide that by 12 then SP = 0.0004166....
this gave me a figure of 3.34 standard deviations (ie. 29.1916 which added to the mean of 83.333 is 112.5... or let us say 113 people)
So if we have 113 people with monthitis in the most commonly encountered month for people with this condition, is this the point at
which statistical significance can be argued to be reasonable at the 1% level or have I missed something out ?
Intuitively I would expect the figure to be higher. (eg. 160 approx.) [Perhaps this is a reflection of the unwise notion of not stating the month in advance with a reason for the choice.]
Question 1:
Can anyone do this with a more exact model using a computer assuming 28 days in February and the correct number of days for all
other months, using a multinomial model (or Loglinear generalized linear model)?
Obviously I ask just out of interest here (I am not studying anything) and as a challenge/exercise/discussion.
In reality of course a proper scientific theory would be needed to justify the initial hypothesis. It would probably mean in practice that
an illness could be more likely to be caught in a particular season such as winter or summer for instance (the fact that I mentioned
birth is irrelevant really in terms of statistical testing, so a similar principle can be applied to a question involving the start date of
the illness).
A better question might therefore be:
Question 2
If a statistical test were done to establish whether the common cold were more likely to be caught in the winter (ie. From the 1st
December to 28th February) in a country in which the cold season is at this time of year according to climate statistics, what threshold
would need to be crossed to confirm using a hypothesis test in terms of number of people out of 1000 cases of a cold being caught
(using a suitable design of a study) being in winter relative to the number caught at a different time of year?
(A method would be nice for that ideally, personally I would start by calculating the number of days from the 1st of December to the
28th of February and divide by 365, then use this as p. Then use the binomial model and use a normal approximation if no computer is
available, but a pocket calculator and a normal distribution table is. Use SP = 0.01 I wonder whether any adjustment is necessary. It
is not really two tailed. Good idea to check whether a continuity correction makes a significant difference I dont think that the
difference is going to matter much in this case.)
The second question is better because a natural climate related reason has been given together with the potential for a plausible
theory. The first question is not good practice in science because no good reason has been given in advance for a particular month of
birth of the monthitis sufferer having a greater likelihood of getting the condition later on in life. It is very unusual for an illness to be
just most likely in a particular month without the neighbouring month or months being also more likely to a lesser extent (whether it is
birth or catching the illness that is being considered).
There was a suggestion that I heard about in a serious study* where because of the education system starting the academic year in
September, a psychological or even physical bias had been noticed to a better performance towards those born near the start of the
academic year (September) relative to the end (August). [*IFS according to the BBC website in Oct 2007.]
This is very different to question 1 of course because I have given a scientific basis for the hypothesis (also the response variable of the
academic performance example above is a numeric concept allowing more detail not true or false as with my made up monthitis). You
would expect a similar result to a lesser extent in people born in October to the people born in September and a sudden jump from
September to August then a minor improvement in July. I suppose with that one you could define the start of the year to be on a
certain day (with careful checks on the official start dates in the country concerned), and do a test based upon the number of days into
the academic year rather than use months. Exactly what statistical test would be used there I do not know (correlation/more linear
statistical modelling?). Obviously a correlation would not prove a causal relationship (the correlation is probably quite weak and does
not prove the matter completely even if it is strong; the sample size however in the real study was probably massive which is still not
proof of course, but makes the evidence stronger). No statistics can ever really prove a causal relationship, but on the other hand what
other method would we use for a social science related study?
Last edited by SteveB (2013-07-06 07:13:12)
Offline
Hi;
That is too much of a question.
General questions such as those are already covered in the theory of the distribution I intend to use.
Never underestimate the power of a demonstration- Edward Muntz
Please provide some data. It should consist of one column, labeled months ( Jan - Dec ) and the next column having the number connected with that month. Then I can solve the problem.
In mathematics, you don't understand things. You just get used to them.
If it ain't broke, fix it until it is.
Always satisfy the Prime Directive of getting the right answer above all else.
Offline
Question 1 was an entirely fictional example so I have made up some data.
Okay here goes:
Off the top of my head though suppose for instance that it was:
January: 114 people born in this month suffer from monthitis
February: 77
March: 78
April: 77
May: 74
June: 82
July: 72
August: 77
September: 71
October: 89
November: 93
December: 96
Right so the question is that given that no statement in advance has been made about January having a higher frequency
of cases than any other month and that a two tailed test is needed, how statistically significant is the above result if it were
a study by a research group using a conventional hypothesis test?
Obviously my question is entirely made up including all of the data.
The sample size is n = 1000
All numbers for each month are confirmed cases of the fictional medical condition or illness.
Last edited by SteveB (2013-07-07 19:36:57)
Offline
That will work fine:
I will use a criterion of 5%.
Theory: there is no correlation between months and the number of illnesses.
By conventional criteria, this difference is considered to be statistically significant.
The p value answers this question: If the theory that generated the expected values were correct, what is the probability of observing such a large discrepancy (or larger) between observed and expected values? A small P value is evidence that the data are not sampled from the distribution you expected.
I would reject the theory that months and illness are not correlated since there is only a 3% chance that the above result happened by chance.
If your test wanted 1% then we can not rule out that the above data is chance.
In mathematics, you don't understand things. You just get used to them.
If it ain't broke, fix it until it is.
Always satisfy the Prime Directive of getting the right answer above all else.
Offline
That will work fine:
I will use a criterion of 5%.
Theory: there is no correlation between months and the number of illnesses.x^2 = 20.936
Was the "x squared" bit based upon a statistical test? Was it a Chi-Squared Test? Are you sure that this is the correct test?
Or was this based upon linear regression?
There was ONLY SUPPOSED to be EXACTLY ONE illness concerned which ALL the people were known to have.
When you say "illnesses" was that a typing error or did you model it as if you were counting how many illnesses in total
were reported to have occured overal in that month? At the time of writing I have not been able to check how you could
model this. Perhaps Linear Regression can be used if this is the assumption, but that would be your question not mine.
You are probably much more experienced than me in this, better qualified and have better software and a newer computer
so don't get me wrong I am not arguing with you it's just that you have used a different test to the one I would have leapt
for had I still had access to Genstat. Out of interest do you have a statistical software package? If so which one if you don't
mind me asking? (Is there a free tool for this with Wolfram? A Wolfram website I looked at some time ago did not seem to do this.)p=0.03404
(3.404 %)
My calculation seemed to suggest about 0.9 % but as you could tell I did NOT trust my answer at all. 3.404% seems right roughly.
That agrees with my rough intuition better than my attempt at a calculation. Was that a linear regression related probability?
Or Chi-Squared? Or Loglinear? Or Multinomial simulation?
By conventional criteria, this difference is considered to be statistically significant.
The p value answers this question: If the theory that generated the expected values were correct, what is the probability of observing such a large discrepancy (or larger) between observed and expected values? A small P value is evidence that the data are not sampled from the distribution you expected.
I would reject the theory that months and illness are not correlated since there is only a 3% chance that the above result happened by chance.If your test wanted 1% then we can not rule out that the above data is chance.
Yes I seemed to have "concluded" that we could, but realised that there was something wrong with my attempt.
I thought that using a graphics calculator to fudge a rough solution by trying to "simplify" things was not really going
to give a very accurate answer. I was going to suggest a Loglinear Contigency Table style solution to this, but needed
Genstat to remind me as to whether this was appropriate. I have still got the text book that accompanied the course
which I studied in 2006 on this topic (for which I got a grade 2 pass in a system where grade 1 is best and grade 4 is
just an ordinary pass grade so I was above average, but not exactly amazing at Linear Statistical Modelling).
I think I need a bit of a refresher course on this if I ever do get to use it for something serious, but as yet I have never
had a job which I would need to do this, but you never know what might happen.
I have decided that I accept your answer as correct, but do not know how you reached the answer nor what formal test
you did. You may even have done a simulation of the "exact" situation to see how many times something as significant
or more significant would happen. In which case you could consider that an unbeatable answer.
Thanks.
Last edited by SteveB (2013-07-08 03:29:15)
Offline
This is a non parametric test called chi squared goodness of fit.
January: 114 people born in this month suffer from monthitis
February: 77
March: 78
April: 77
May: 74
June: 82
July: 72
August: 77
September: 71
October: 89
November: 93
December: 96
The chi squared goodness of fit test just computed whether the above data could possibly come from the hypothesis that each month should have the same number. It is unlikely that is true.
This vid will describe the technique:
http://www.youtube.com/watch?v=DLzztj39V4w
http://www.stat.yale.edu/Courses/1997-98/101/chigf.htm
There was ONLY SUPPOSED to be EXACTLY ONE illness concerned which ALL the people were known to have.
The calculation did precisely that.
(Is there a free tool for this with Wolfram? A Wolfram website I looked at some time ago did not seem to do this.)
In mathematics, you don't understand things. You just get used to them.
If it ain't broke, fix it until it is.
Always satisfy the Prime Directive of getting the right answer above all else.
Offline
Thankyou I am now convinced that you have answered the question 1 bit correctly.
I quite often find that I am okay once I have established which test is the right one to use
out of a very large number of hypothesis tests that have been invented.
Of course the design of the pseudo study was one I deliberately put in a way as to make it
unusual in that it would only be valid to think of the months in a categorical way if there
were an artificial construct that made one unknown month very different (perhaps) to all
of the others (or at least gave each month its own independent character).
If as in the "common cold" case you wanted to test for whether cold times of the year had
some link to the frequency of catching the cold virus you would not of course consider there
to be a sudden jump from February to March for instance. A natural occurence rather than
a sudden artificial leap would be happening, so the study would be best if it were designed
in a way to give us more accurate information to test correlation between temperature and
the number of cases of the cold illness.
I suppose if a factory were giving off a pollutant just in January only and in no other month,
then if the chemical were to make monthitis more likely to occur at birth, then maybe the
might be a very rare use for this. On the other hand why would the newborn infant be suddenly
able/unable to "catch" the "condition" ONLY exactly at birth/not after birth.
In practice it would not happen like that so even in this case the study would be daft.
To be frank I cannot at this time think of a serious reason why this would be a statistically desirable
hypothesis test study design.
On the other hand given the question that I asked I would give your answer full marks.
Nice one, and a good refresher lesson for me in Chi-Squared tests.
Last edited by SteveB (2013-07-08 06:58:39)
Offline
I think what I had meant really is perhaps initially to help understand things work out the distribution formula/formulas
of the maximum of the 12 values for the numbers in each month when distributed as you would expect when putting
1000 individuals randomly into the 12 categories.
Then to use this to work out the 99th percentile of the maximum of an arbitrary instance of 1000 individuals being randomly (uniformly)
put into 12 categories, or even according to the actual proportions of the months taking into account the exact number
of days. (The bit above is not really needed, but probably a good idea to try first if one were to try to calculate it exactly.)
Maybe it is better to say "Work out the 99th percentile having found a formula for the distribution of the maximum of a
given set from which 1000 people have been put into 12 categories in a way according to a uniform distrubution with
either equal proportions for all 12, or even more difficultly 12 non-equal proportions."
The answer is a single real number which should probably be rounded upwards to the nearest whole number above.
(A few rather vague guesses might be: 113, 129, 123 etc., but without more evidence I do not know. Chi-Squared does not
give the correct answer to this because it gives hypothesis analysis treating all 12 categories as potentially biassed and
it could be said that I am considering that either 11 are uniform and 1 is biassed or all 12 are uniform as the null hypothesis
this as I have explained is so ridiculous in real life contexts that I so far cannot think as to why a well designed serious
study should really do this, but an MIT lecture used something similar as a purposefully very bad example of a poor study design
since it "draws a dartboard around the result after the data was collected" or in my words either uses an unjustified unsound SP, or
leaves the statistician with the job of working out a suitable increased level of significance required to make the hypothesis
one which COULD still be rejected, but only if some very remarkable result were obtained where one month were very
unusual indeed. I am trying to formalise my attempt of dividing by 12 (or 24) the significance probability, but intuitively it needs
probably a very complicated adjustment, it is more difficult than I thought, and probably more like dividing SP (0.01) by 132 say.
The SP fudge factor may of course be an irrational number with factorials, square roots and all sorts all over the place.)
I suppose I have been taught the ground work to be able to program a computer to do a simulation, but it is rather
awkward - not something I would want to spend too long on, and not an entirely satisfactory answer even then.
If someone finds things like this easy or has a software package that can solve this in less than
30 minutes work (eg. bobbym perhaps if his software packages are very powerful indeed.)
then it would be interesting to see if you could answer this to help build my intuition.
If a serious course taught how to solve this they would probably want the algebra that proves
and/or illustrates why it works and a numerical answer (eg. 121) would not get any marks.
My intuition is this is of Masters Degree Level, but someone might prove me wrong. (I had intended it to be only A-Level or BSc)
I wonder whether the loaded dice formulas are of any use with this... (Here we go again...)
The problem is rather like trying to prove a matter concerning a 12 sided die by rolling it 1000 times.
I have tried looking at the biassed coin wikipedia page and I cannot really adapt it for this because
that is fine for two states (heads,tails) but not for {1,2,3,4,5,6,7,8,9,10,11,12} - 12 states.
It is at least one level of difficulty above a problem that I found that the wiki site did something
that I did not understand and anonimnystefy did not really give a very full account of, but sounded
like he knew and understood about in the Exercises - Statistics thread some time ago. I can apply the formula,
but do not fully understand the derivation, and this would have to be made even more complex
to solve the full solution to the problem.
Last edited by SteveB (2013-07-09 07:19:37)
Offline
Hi;
The problem is rather like trying to prove a matter concerning a 12 sided die by rolling it 1000 times.
What is needed is a full understanding of the term "matter."
What property are you trying to prove? Please state it as simply as you can and try to think of an example. Then I can work on it.
In mathematics, you don't understand things. You just get used to them.
If it ain't broke, fix it until it is.
Always satisfy the Prime Directive of getting the right answer above all else.
Offline
Here is my simple version of the problem to illustrate the distribution change caused by taking
the maximum of a uniform set:
Consider taking a set of n values from a set containing m integers.
Let us take a pair {1,2} from {1,2,3,4,5}.
You can always define a maximum of the set {1,2} in this case it is 2.
Therefore it is reasonable to ask: What is the median and/or mean maximum of the pair taken from {1,2,3,4,5}?
So the pairs are:
{1,2} {1,3} {1,4} {1,5} [Maximums:2,3,4,5]
{2,3} {2,4} {2,5} [Maximums:3,4,5]
{3,4} {3,5} [Maximums:4,5]
{4,5} [Maximum: 5]
So the mean is: 4
The list of 10 items is {2,3,3,4,4,4,5,5,5,5} so the median is 4 (slight variations are possible).
To the nearest integer you could call the 99th percentile 5, but conventions could make it
slightly lower. I have decided that 4.97 based upon 2 + (5-2)*0.99 is the most sensible way
of calulating a non-integer version of the 99th percentile. Unless I have misunderstood the
Wikipedia entry the formulas do not really work for this case and either you get a boring 5,
or a number that exceeds 5 which is obviously wrong or not a good idea. I notice that the normal
model of N(4,1) that is with mean=4 and variance=1 so standard deviation = 1, if you take
the value Z = 2.33 you get a silly value of 6.33 for the 99th percentile.
Obviously the normal model assumes a non bounded real valued domain, so I suppose things
that are likely to happen so 4.97 is my best attempt at this version of the problem.
---------------------------------------------------------------------------------------------------------
Regarding the more difficult problem with 12 months, 1000 sampled and maximums taken, then
estimate the 99th percentile of repeats.
Update:
I have written in Java using the Math.random() function (subject to any issues regarding the
degree of correctness of the random distribution from 0 to 1 of the random generator),
a simulation of 1000 iterations of 1000 sampled days of the year converted into a month.
It assumes that there are 28 days in February, and 30 or 31 days in the other months in the
usual way you would expect them in an ordinary western calendar (UK, USA, etc..). So a non leap year.
I have then selection sorted the list of 1000 runs, and looked at about the last 20 items in
the list and made a judgement upon things like (989th, 990th, 991st cases) - for 99th percentile.
then (995th and 996th cases) - for 99.5th percentile (more robust two tailed test).
My conclusion after 5 runs of this 1000 of 1000 simulation:
(1) The 99th percentile I was trying to describe is 113 (once 112, but 113 four times).
(2) The 99.5th percentile I was also looking for is 114 (twice 113, but 114 three times).
It looks like my original normal approximation in post #1 was correct in terms of the logic
of what I was thinking about. My test data that followed was a good borderline case.
The Chi-Squared answer is a good answer to a slightly different question.
If there were 12 colours {red,yellow,blue,green,orange,purple,brown,black,white,grey,pink,indigo}.
Let us suppose that 1000 people were surveyed and a test was done to see whether the result
could be a random set of equally likely (p = (1/12)) with the people asked to give their favourite,
then the Chi-Squared test would be a very good test to do.
Provided that it is acceptable to think of the months as categories rather than the fact that time
is a major feature and that they merge into each other gradually in terms of physical (eg. climate)
effects which are the scientific things that are thought to influence the thing being studied, the
Chi-Squared would have done, but really it does not model the situation well. For instance all
of the month devations are taken into account not just the 'maximum' month.
However the chance of an unnamed highest frequency being greater than a threshold T being set
so 1% then "what is the borderline theshold T?" would need my version of the problem.
Using this sort of analysis with data involving months is usually a study design weakness, but
perhaps there are exceptional cases when it is okay (I cannot think of any).
Last edited by SteveB (2013-07-16 09:27:57)
Offline
Hi;
I do not know whether this will help or not but the first question appears to be
Consider taking a set of n values from a set containing m integers.
Let us take a pair {1,2} from {1,2,3,4,5}.
You can always define a maximum of the set {1,2} in this case it is 2.
Therefore it is reasonable to ask: What is the median and/or mean maximum of the pair taken from {1,2,3,4,5}?
In mathematics, you don't understand things. You just get used to them.
If it ain't broke, fix it until it is.
Always satisfy the Prime Directive of getting the right answer above all else.
Offline
I have checked your formula for the mean in the case: n = 3 and m = 8
I worked out the mean of 6.75 using my rather tedious method, and your formula also reassuringly gave 6.75 for this case.
I will therefore assume that the formula is correct and works for the mean.
Now what about median (50th percentile) and other percentiles like 99th percentile (and 99.5, 95, 97.5) ?
The question for that part is harder than I had bargained for because it is not a continuous set that we are
taking the numbers from and we have formulas that are correspondingly discrete so we cannot just integrate
the formula and use an area under the curve argument to define the percentile such as for the median:
median = (point at which area up to that point from lowest range value = 0.5)
With n = 3 and m = 8:
The boring answer is 7 on the basis of ((56 + 1)/ 2) = 28.5 then take the average of the 28th and 29th data item of 56 items.
This is okay for an integer version of the median problem, but some conventions might want a curve fitted then some
transformed normal or another continuous approximation used.
If we use the integer version then anything from about the 64th percentile onwards the answer is just 8.
I was hoping for a continuous analogue solution to give some answer inbetween 7 and 8 for 80th, 90th, 95th percentile etc....
In other words something like for the 90th percentile: (56+1)*0.9 = "51.3 th item" (informally)
Then some clever weighted average is done and ends up with something like 7.73
eg. (27 * 8) + (10 * 7 ) = 286
then 286 / 37 = 7.72972.... = 7.73 to 2 d.p.
based on the 63th to 64th cut off between 7 and 8, and the fact that (90 - 63) = 27 and (100 - 90) = 10
then invert the balance with 27 * 8 and 10 * 7 and divide by the frequency.
However that was just one thought I had about that particular case which might give some strange results in others.
It either needs formalizing. Or maybe it just doesn't give the right statistical answers.
It is difficult enough trying to do that for a simple case. The months case is a lot more difficult because there are
two phases of random generation: First populate 12 values using 1000 samples, then take the maximums, and
then repeat a large number of times (eg. 100, 1000, etc.) and work out a 99th percentile or perhaps the 99.5th percentile.
It would be a struggle to derive a general formula for a percentile for the case n = 3 and m = 8.
That was with a uniform sequence {1,2,3,4,5,6,7,8} and select 3 and take the maximum.
The set with the months is not a uniform set of 12 numbers. They are randomly distributed multinomially from 1000 samples.
Last edited by SteveB (2013-08-03 03:40:12)
Offline
With the case of taking 3 items from 8 numbers from one to 8 that is n = 3 and m = 8, and with the example of trying
to calculate the 90th percentile I have worked out this illustration of using a continuous approximation and using the
area under the curve:
A good approximation to the frequencies is:
By the area under a curve definition:
Using the area under the curve formula and by solving the cubic polynomial I am getting p = 7.7803 (to 4 d.p.)
Looks like a sensible answer and close to my rough estimate in the previous post.
By a similar process using the above approximation I am getting 6.6912 (to 4 d.p.) for the median.
Last edited by SteveB (2013-08-03 19:43:28)
Offline
You could argue that the curve between the points is not really justified and that perhaps a straight line between them
is better so you could start with the frequency chart and then draw a triangles and rectangles chart from it by joining
the points (X, Frequency) for X = 3 to X = 8.
X Frequency
3 1
4 3
5 6
6 10
7 15
8 21
The rectangle areas are: 1 + 3 + 6 + 10 + 15 = 35
The triangle areas are: 0.5 * (2 + 3 + 4 + 5 + 6) = 10
For the 90th percentile:
45 * 0.9 = 40.5
The first four rectangles are: 1 + 3 + 6 + 10 = 20
The first four triangles are: 0.5 * (2 + 3 + 4 + 5) = 7
Total of first four = 27
So 13.5 more needed to make up the 90th percentile.
This is out of 18 remaining.
Using a diagram I concluded that 15x + 3x^2 = 13.5
The valid solution is x = 0.778719262151...
Hence by adding 7 to this my perecentile estimate was 7.7787 (to 4 d.p) (assuming I haven't made any mistakes with that)
By a similar argument for the 50th percentile (the median) I am getting 6.683281573
or 6.6833 to 4 d.p.
The discrete integer median is 7 and the discrete integer 90th percentile is 8.
The non integer versions are really approximations based upon a continuous curve that either passes through the
points or gives a good approximation of the distribution. In this case I have made a curve through the points for the
first pair of estimates, and a series of straight lines for the second.
The orignial solution that I gave to the 12 months maximums problem used an approximation using the normal
distribution model without any transformation apart from using an appropriate standard deviation and mean.
It assumed that the distribution was symmetrical, whereas in reality it had a right skew caused by the greater capacity for
maximums higher than the mean compared to a squashed distribution below the mean. I could try transforming the
normal curve and some statistics software you can get nowadays makes this easy to do. I should think that the 99.5th
percentile is either 114 or 115, but it is difficult to be more exact without better software or writing some code myself
to do a better simulation. For the time being I am calling it 114.
(That was the 99.5th percentile estimate of maximums taken from 1000 samples put into 12 month categories randomly.)
Last edited by SteveB (2013-08-07 20:42:21)
Offline
With 10000 repeats of 1000 samples giving a maximum for the highest month, then giving frequencies for each maximum
of the highest month. My first proper run, after debugging of this, had a result of 114 for the 99.5th percentile,
but it was very close to it being 115 as the 99.5th percentile:
84 frequency 0
85 frequency 0
86 frequency 0
87 frequency 0
88 frequency 5
89 frequency 33
90 frequency 92
91 frequency 173
92 frequency 308
93 frequency 491
94 frequency 589
95 frequency 761
96 frequency 793
97 frequency 875
98 frequency 872
99 frequency 842
100 frequency 725
101 frequency 691
102 frequency 590
103 frequency 463
104 frequency 344
105 frequency 315
106 frequency 268
107 frequency 202
108 frequency 141
109 frequency 118
110 frequency 84
111 frequency 79
112 frequency 48
113 frequency 25
114 frequency 24
115 frequency 11
116 frequency 13
117 frequency 12
118 frequency 3
119 frequency 4
120 frequency 1
121 frequency 1
122 frequency 2
123 frequency 1
126 frequency 1
Median: 99
99.5th percentile: 114
Notice that 49 items are 115 or higher. With only one run of the code 115 could still be argued to be a good answer.
Technically the 99.5th percentile is 114, but it is so close that more data is needed.
EDIT/UPDATE: With an adjustment to the program to make it do 100,000 runs I have decided that 115 is the 99.5 percentile,
but it is not exactly that clear cut even then, with 400 runs of 116 or higher, and 553 runs of 115 or higher.
Assuming that run of 100,000 is typical though it is reasonable to call the 99.5th percentile 115.
So in a two tailed strong result (at the 1% significance level) for this at least 115 is needed in the highest month.
In the one tailed version at the 1% significance level, at least 113 is needed for a strong result according to the 100,000 repeats run.
(That is to say the 99th percentile was 113.)
For a two tailed (97.5th percentile) moderate evidence (5% level significance) result at least 110 was needed.
For a one tailed (95th percentile) moderate evidence (5% level significance) result at least 108 was needed.
The 90th percentile was 105, the median (50th percentile) was 98.
Last edited by SteveB (2013-08-08 08:34:57)
Offline
For the somewhat easier question (2) my answer is:
December has 31 days, January has 31 days, February has 28 days (assuming the study is done on a non leap year).
31 + 31 + 28 = 90 days in winter period as defined in the question
365 - 90 = 275 days in non winter period
Let the sample size be n = 1000
mean in winter period using binomial model = np = (90/365) x 1000 = 246.57....
variance using binomial model = npq = (90/365) x (275/365) x 1000 = 185.775....
standard deviation = 13.62996....
Use normal approximation with 2.33 standard devations from the mean for 1% siginificance test, one tailed:
2.33 x 13.62996.... + 246.57.... = 278.3331....
Using rounding up to allow a little extra 279 would be my answer (using the normal approximation) to question 2.
So at least 279 cases of catching a cold in winter out of 1000 cases of catching a cold would give strong evidence
in a one tailed scientific study.
In practice the scientist would probably want to log every exact date combined with the recent weather conditions of each,
including maximum and minimum temperature of recent days prior to catching a cold, then do some sort of regression analysis of
temperature against the number of cold cases per unit of time. Exactly how to do a hypothesis test for this data I am not
sure at the moment. I think something similar did get taught in my course in 2006, but I cannot remember the details at present.
I have lost some of my notes and materials, but do still have the textbook that came with the course.
A scattergraph would probably show an interesting picture, but how to formalize it I am not sure. I think it was made easy
with some software which avoided the need for hand calculations.
Last edited by SteveB (2013-08-10 08:02:46)
Offline