AS5001 (SUPA-AAA) Advanced (Astronomical) Data Analysis

Problem Set 1, due 9 am Mon 3 Oct 2022

Problem 1. Histogram and Cumulative Distributions of Random Variables [25]

(a) Be encouraged to build up a toolkit of function definitions (subroutines) that you can then use again and again in different contexts. Write a function to generate a random number distributed uniformly between specified limits A and B. Document the inputs and outputs of each subroutine. For example, in Python your function RANU might start with comments as follows:

Note: This should be very easy because you can use the numpy random number generator, e.g. np.random.uniform, or equivalent, to obtain random numbers distributed uniformly between 0 and 1, and np.random.seed to set the random number generator seed.

Use RANU to generate 10 random numbers uniformly distributed between -10 and +10. Print those 10 values. Now use RANU a second time to re-generate the same sequence of random numbers and print them to show that they are the same as the first sequence.

(b) Plot the binned histogram and cumulative (staircase) distribution function for $N$ samples drawn from your uniform random number generator RANU. The staircase should step up at each of the $N$ values, not just at the histgram bin boundaries, so that the finite bin size of your histogram does not degrade the resolution of your cumulative distribution. Make plots for $N = 10; 100; 1000; 10^4$.

(c) Write similar functions rang(avg,sig,iseed) for Gaussian and rane(tau,iseed) for Exponential random numbers. These should transform values from ranu into samples from the Gaussian and Exponential distributions. Make plots as in (b) for $N = 10^4$.

Problem 2. Mean and Variance, Median and MAD [25]

(a) Write a function that computes the the sample mean and sample variance of an input array of data values. Take care to ensure that the sample variance is an unbiased estimator. As always, describe the input and output arguments of the surbroutine in comments at the top.

(b) Write a function to compute the median, and the mean-absolute-deviation (MAD) of an array of data. Note: To compute the median, sort the data values, find the value half-way through the sorted list, and interpolate if needed.

Sanity-check MEANVAR and MEDMAD using e.g. (1,2,3) and (1,2,3,4) as input.

(c) I'll supply a dataset from part of a zero-exposure CCD frame. Use your routines meanvar(x) and medmad(x) to compute statistics that estimate the CCD bias level (mean and median) and the readout noise (standard deviation and mean-absolute-deviation). Summarise your results in a table, giving also a 1-$\sigma$ error bar for your sample mean, for the standard deviation, and (if possible) for the median.

Problem 3. Central Moments of a Distribution [25]

(a) Write a subroutine that computes statistics that estimate the mean, variance, and higher-order central moments of a dataset.

(b) Use ranu(a,b), rang($\mu,\sigma$) and rane($\tau$) (Problem 1) to generate $N=10^6$ random samples of Uniform, Gaussian, and Exponential distributions. Use moments(x,m) to compute the first 4 central moments in each case. Derive analytic formulae for the first 4 central moments in terms of the parameters of each distribution. (The required integrals are not too difficult. If you get stuck, for partal credit, try http://integrals.wolfram.com.) Prepare a table comparing your numerical and analytic results (test at least 2 sets of input parameters) to demonstrate the accuracy of your subroutines, and comment on reasons for any discrepancies.

Problem 4. Monte-Carlo Simulation to compare Sample Means and Medians [25]

(a) The median $X_{\rm med}$ is a "robust" statistic, less sensitive to large outliers than the sample mean $\overline{X}$. For this reason, a "running median filter" is often used to "smooth" or remove "spikes" from a light curve or spectrum. Medians are also often used to combine a series of CCD frames so that the result is not badly affected by cosmic ray hits. However, $X_{\rm med}$ is a noisier statistic than $\overline{X}$, as you will discover in the following Monte-Carlo calculation.

Use your subroutine rang($\mu$,$\sigma$) to draw $N$ samples from a Gaussian distribution with $(\mu,\sigma)=(0,1)$. Calculate the sample mean $\overline{X}$ and median $X_{\rm med}$ for the set of $N$ samples. Repeat the above calculation $M$ times, and characterise the distributions of $X_{\rm med}$ and $\overline{X}$ by calculating their mean values and standard deviations over the $M$ samples. Choose $M$ large enough to keep uncertainties below 1% of the standard deviations. Make a log-log plot showing from your Monte-Carlo simulation how the standard deviations of $X_{\rm med}$ and $\overline{X}$ vary with $N$ for $N=1,2,3,...,100$. (If this calculation takes a very long time, try to improve the efficiency of your code.) Which statistic has the larger variance, $X_{\rm med}$ or $\overline{X}$? Based on your results, decide whether it is better to take medians with an odd or even number of data points. Explain your reasoning.

(b) For a sample of $N$ Gaussian random variables (each with mean 0 and standard deviation $\sigma$), derive an analytic formula in terms of $N$ and $\sigma$ for $\sigma^2(\overline{X})$, the variance of the sample mean. Derive a similar formula for $\sigma^2(X_{\rm med})$, in the limit of large $N$. Which is larger, $\sigma^2(\overline{X})$ or $\sigma^2(X_{\rm med})$? Compare your formulae with the large-$N$ results from your Monte-Carlo simulation.