PREP Quickstart, Statistics and Distributions
system:sage


<h1 style="font-size: 2em;">Sage Quickstart for Statistics</h1>
<p>This&nbsp;<a href="http://www.sagemath.org" target="_blank">Sage</a>&nbsp;worksheet was developed for the MAA PREP Workshop "Sage: Using Open-Source Mathematics Software with Undergraduates" (funding provided by NSF DUE 0817071).</p>
<p>Although Sage began as a project in algebra and geometry, it has many functions for statistics and finance. Particularly due to the&nbsp;<a href="http://www.r-project.org" target="_blank">R project</a>&nbsp;being a component of Sage, we have very powerful statistical techniques at our disposal. &nbsp; However, some basic things are built right in.</p>

{{{id=1|
mean([1,2,3,5])
///
11/4
}}}

{{{id=3|
std([1,2,2,4,5,6,8]) # The standard deviation
///
sqrt(19/3)
}}}

<p>Once we get beyond such things, it will be slightly more complicated. &nbsp;Even if you are just trying to generate a random sample from a given type of continuous distribution, there are several ways to do this. &nbsp;In the first example, we use the simplest method of generating random elements from a log normal distribution with (normal) mean 2 and $\sigma=3$. &nbsp;Notice that there is really no way around making some kind of loop.</p>

{{{id=6|
my_data=[lognormvariate(2,3) for i in range(10)]
my_data
///
[1.2200073608568685, 0.99262658944848614, 3.4397547697767741, 3.4838891195212311, 19654.894101139962, 26.815549450474581, 164.14210210663543, 774.31598178441175, 57.1938092064524, 11.286595257004107]
}}}

<p>We can check whether the mean of the log of the data is close to 2.</p>

{{{id=19|
mean([log(item) for item in my_data])
///
3.4072843137524691
}}}

<p>In the following example, we let 'dist' be the variable assigned to a continuous Gaussian/normal distribution with standard deviation of 3, and then use the '.get_random_element()' method ten times, adding 2 each time so that the mean is equal to 2. This uses the <a href="http://www.gnu.org/software/gsl/" target="_blank">Gnu scientific library</a>&nbsp;under the hood.</p>

{{{id=8|
dist=RealDistribution('gaussian',3)
my_data=[dist.get_random_element()+2 for _ in range(10)]
my_data
///
[2.91919853699, 0.726840961836, 1.38635997794, 0.713787150357, 5.9164163059, 3.94242954162, 5.58270406261, -3.22142134694, 2.78213058546, 1.19652246063]
}}}

<p>Unfortunately, we don't yet have as easy access to discrete distributions. &nbsp;To find these, we access one of several other pieces of Sage which have statistics built into them. &nbsp;One of these is <a href="http://www.scipy.org" target="_blank">Scipy</a>,&nbsp;which we have to 'import'. &nbsp;</p>
<p>Here, we are calling 'binom_dist' the binomial distribution with 20 trials and 5% expected failure rate. &nbsp;The '.pmf(x)' method gives the probability of $x$ failures, which we then plot in a bar chart for $x$ from 0 to 20. &nbsp;(Note that 'range(21)' means all integers from <em>zero to twenty</em>!)</p>

{{{id=11|
import scipy.stats
binom_dist = scipy.stats.binom(20,.05)
bar_chart([binom_dist.pmf(x) for x in range(21)])
///
<html><font color='black'><img src='cell://sage0.png'></font></html>
}}}

<p>Scipy's statistics can do other things too. &nbsp;Here, we find the median (as the fiftieth percentile) of an earlier data set.</p>

{{{id=12|
scipy.stats.scoreatpercentile(my_data, 50)
///
2.0842452816975765
}}}

<p>The key thing to remember here is to look at the documentation! &nbsp;Particularly for Scipy, not everything in Sage is "wrapped" with an easy command, so you may have to do some experimentation. &nbsp;This is also a place where we would be looking for volunteers to help improve things!</p>
<p>There are several other pieces of Sage that have statistical capabilities, but by far the most important is the&nbsp;<a href="http://www.r-project.org" target="_blank">R project</a>, which is the industry and academic standard for statistical analysis of <em>all</em>&nbsp;kinds. &nbsp;We'll finish our 'Quickstart' with this.</p>
<p>Even here, there are several ways to access R. &nbsp;One of the easiest is to just put 'r()' around things you want to make into statistical objects, and then use R commands via 'r.method()' to pass them on to Sage for further processing. &nbsp;This is particularly nice if you will use the results in something else. &nbsp;However, it isn't always easy to do very complicated commands using this interface.</p>
<p>The following example of the Kruskal-Wallis test comes directly from the examples in 'r.kruskal_test?' in the notebook. &nbsp;</p>

{{{id=14|
x=r([2.9, 3.0, 2.5, 2.6, 3.2]) # normal subjects
y=r([3.8, 2.7, 4.0, 2.4])      # with obstructive airway disease
z=r([2.8, 3.4, 3.7, 2.2, 2.0]) # with asbestosis
a = r([x,y,z]) # make a long R vector of all the data
b = r.factor(5*[1]+4*[2]+5*[3]) # create something for R to tell which subjects are which
a; b # show them
///
 [1] 2.9 3.0 2.5 2.6 3.2 3.8 2.7 4.0 2.4 2.8 3.4 3.7 2.2 2.0
 [1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
}}}

{{{id=18|
r.kruskal_test(a,b) # do the KW test!
///
	Kruskal-Wallis rank sum test

data:  sage36 and sage2 
Kruskal-Wallis chi-squared = 0.7714, df = 2, p-value = 0.68
}}}

<p>Looks like we can't reject the null hypothesis here.</p>
<p>Finally, the best way to use R seriously is to simply ask each individual cell to evaluate completely in R, using a so-called "percent directive". &nbsp;Here is a sample linear regression from John Verzani's <a href="http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf" target="_blank">simpleR</a> text. &nbsp; Notice that R also uses the '#' symbol to indicate comments.</p>

{{{id=16|
%r
x = c(18,23,25,35,65,54,34,56,72,19,23,42,18,39,37) # ages of individuals
y = c(202,186,187,180,156,169,174,172,153,199,193,174,198,183,178) # maximum heart rate of each one
png() # turn on plotting
plot(x,y)	# make a plot 
lm(y ~ x) # do the linear regression
abline(lm(y ~ x))	# plot the regression line
dev.off()     # turn off the device so it plots
///
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x  
   210.0485      -0.7977  


null device 
          1 
}}}

<p>To get a whole worksheet to evaluate in R (and be able to ignore the '%'), you could also drop down the 'r' option in the menu close to the top which currently has 'sage' in it.</p>
<p>There are many texts (and even <a href="http://www.maa.org/prep/2010/stat_thinking.html" target="_blank">another PREP workshop</a> this summer) which discuss using R in introductory statistics, so this 'Quickstart' should be just the beginning if you are interested!</p>

{{{id=17|

///
}}}