# Enable inline plotting in notebook
%matplotlib inline
# Populate namespace with numerical python function library and matplotlib plotting library.
import numpy as np
import matplotlib.pyplot as plt
(a) Be encouraged to build up a toolkit of function definitions (subroutines) that you can then use again and again in different contexts. Write a function to generate a random number distributed uniformly between specified limits A and B. Document the inputs and outputs of each subroutine. For example, in Python your function RANU might start with comments as follows:
def ranu(a,b,iseed):
# Return on each call a new random number sampling
# a uniform distribution between limits A and B,
# and a new value of the seed integer ISEED.
# Input:
# a (float) lower limit of boxcar
# b (float) upper limit of boxcar
# iseed (int) seed integer used for the random number generator
# Output:
# ranu (float) uniform random number
.
.
.
return ranu
Note: This should be very easy because you can use the numpy random number generator, e.g. np.random.uniform, or equivalent, to obtain random numbers distributed uniformly between 0 and 1, and np.random.seed to set the random number generator seed.
Use RANU to generate 10 random numbers uniformly distributed between -10 and +10. Print those 10 values. Now use RANU a second time to re-generate the same sequence of random numbers and print them to show that they are the same as the first sequence.
(b) Plot the binned histogram and cumulative (staircase) distribution function for $N$ samples drawn from your uniform random number generator RANU. The staircase should step up at each of the $N$ values, not just at the histgram bin boundaries, so that the finite bin size of your histogram does not degrade the resolution of your cumulative distribution. Make plots for $N = 10; 100; 1000; 10^4$.
(c) Write similar functions rang(avg,sig,iseed) for Gaussian and rane(tau,iseed) for Exponential random numbers. These should transform values from ranu into samples from the Gaussian and Exponential distributions. Make plots as in (b) for $N = 10^4$.
(a) Write a function that computes the the sample mean and sample variance of an input array of data values. Take care to ensure that the sample variance is an unbiased estimator. As always, describe the input and output arguments of the surbroutine in comments at the top.
def meanvar(x):
# compute sample mean and sample variance of data array x
# In: x (float) array of data values
# Out: avg (float) sample mean of data values
# var (float) sample variance of the data
n = len(x)
.
.
.
return avg, var
(b) Write a function to compute the median, and the mean-absolute-deviation (MAD) of an array of data. Note: To compute the median, sort the data values, find the value half-way through the sorted list, and interpolate if needed.
def medmad(x):
# compute median and mean-absolute-deviation of data array x
# In: x (float) array of data values
# Out: xmed (float) median of data values
# xmad (float) mean-absolute deviation (relative to median)
n = len(x)
.
.
.
return xmed, xmad
Sanity-check MEANVAR and MEDMAD using e.g. (1,2,3) and (1,2,3,4) as input.
(c) I'll supply a dataset from part of a zero-exposure CCD frame. Use your routines meanvar(x) and medmad(x) to compute statistics that estimate the CCD bias level (mean and median) and the readout noise (standard deviation and mean-absolute-deviation). Summarise your results in a table, giving also a 1-$\sigma$ error bar for your sample mean, for the standard deviation, and (if possible) for the median.
import numpy as np
datadir = '/Users/acc4/Downloads/'
file = datadir+ 'bias.dat'
data = np.loadtxt(file,skiprows=6)
x,y,value = data[:,0],data[:,1],data[:,2]
.
.
.
(a) Write a subroutine that computes statistics that estimate the mean, variance, and higher-order central moments of a dataset.
def moments(x, m):
# compute first M central moments of the data array x
# In: x (float) array of data values
# m (int) number of moments
# Out: xmom[0] (float) sample mean of data
# xmom[1] (float) sample variance (unbiased for sigma^2)
# xmom[m-1] (float) m-th central moment, normalised to sigma^m)
n = len(x)
.
.
.
return xmom
(b) Use ranu(a,b), rang($\mu,\sigma$) and rane($\tau$) (Problem 1) to generate $N=10^6$ random samples of Uniform, Gaussian, and Exponential distributions. Use moments(x,m) to compute the first 4 central moments in each case. Derive analytic formulae for the first 4 central moments in terms of the parameters of each distribution. (The required integrals are not too difficult. If you get stuck, for partal credit, try http://integrals.wolfram.com.) Prepare a table comparing your numerical and analytic results (test at least 2 sets of input parameters) to demonstrate the accuracy of your subroutines, and comment on reasons for any discrepancies.
(a) The median $X_{\rm med}$ is a "robust" statistic, less sensitive to large outliers than the sample mean $\overline{X}$. For this reason, a "running median filter" is often used to "smooth" or remove "spikes" from a light curve or spectrum. Medians are also often used to combine a series of CCD frames so that the result is not badly affected by cosmic ray hits. However, $X_{\rm med}$ is a noisier statistic than $\overline{X}$, as you will discover in the following Monte-Carlo calculation.
Use your subroutine rang($\mu$,$\sigma$) to draw $N$ samples from a Gaussian distribution with $(\mu,\sigma)=(0,1)$. Calculate the sample mean $\overline{X}$ and median $X_{\rm med}$ for the set of $N$ samples. Repeat the above calculation $M$ times, and characterise the distributions of $X_{\rm med}$ and $\overline{X}$ by calculating their mean values and standard deviations over the $M$ samples. Choose $M$ large enough to keep uncertainties below 1% of the standard deviations. Make a log-log plot showing from your Monte-Carlo simulation how the standard deviations of $X_{\rm med}$ and $\overline{X}$ vary with $N$ for $N=1,2,3,...,100$. (If this calculation takes a very long time, try to improve the efficiency of your code.) Which statistic has the larger variance, $X_{\rm med}$ or $\overline{X}$? Based on your results, decide whether it is better to take medians with an odd or even number of data points. Explain your reasoning.
(b) For a sample of $N$ Gaussian random variables (each with mean 0 and standard deviation $\sigma$), derive an analytic formula in terms of $N$ and $\sigma$ for $\sigma^2(\overline{X})$, the variance of the sample mean. Derive a similar formula for $\sigma^2(X_{\rm med})$, in the limit of large $N$. Which is larger, $\sigma^2(\overline{X})$ or $\sigma^2(X_{\rm med})$? Compare your formulae with the large-$N$ results from your Monte-Carlo simulation.