A Greedy Algorithm for Representative Sampling: repsample in Stata
Quantitative empirical analyses of a population of interest usually aim to estimate the causal eect of one or more independent variables on a dependent variable. How- ever, only in rare instances is the whole population available for analysis. Researchers tend to estimate causal eects on a selected sample and generalize their conclusions to the whole population. The validity of this approach rests on the assumption that the sample is representative of the population on certain key characteristics. A study using a non-representative sample is lacking in external validity by failing to minimize population choice bias. When the sample is large and non-response bias is not an issue, a random selection process is adequate to ensure external validity. If that is not the case, however, researchers could follow a more deterministic approach to ensure representativeness on the selected characteristics, provided these are known, or can be estimated, in the parent population. Although such approaches exist for matched sampling designs, research on representative sampling and the similarity between the sample and the parent popula- tion seems to be lacking. In this article we propose a greedy algorithm for obtaining a representative sample and quantifying representativeness in Stata.