- Analytic perspective
- Open Access
Optimisation of the T-square sampling method to estimate population sizes
© Bostoen et al; licensee BioMed Central Ltd. 2007
- Received: 29 September 2006
- Accepted: 01 June 2007
- Published: 01 June 2007
Population size and density estimates are needed to plan resource requirements and plan health related interventions. Sampling frames are not always available necessitating surveys using non-standard household sampling methods. These surveys are time-consuming, difficult to validate, and their implementation could be optimised. Here, we discuss an example of an optimisation procedure for rapid population estimation using T-Square sampling which has been used recently to estimate population sizes in emergencies. A two-stage process was proposed to optimise the T-Square method wherein the first stage optimises the sample size and the second stage optimises the pathway connecting the sampling points. The proposed procedure yields an optimal solution if the distribution of households is described by a spatially homogeneous Poisson process and can be sub-optimal otherwise. This research provides the first step in exploring how optimisation techniques could be applied to survey designs thereby providing more timely and accurate information for planning interventions.
- Estimate Population Size
- Homogeneous Poisson Process
- Complete Spatial Randomness
- Optimal Sample Size
- Travelling Salesperson Problem
There is a constant need to estimate population size and density for the purposes of planning resource requirements or assessing health needs. For reasons relating to timeliness, cost or practicality, data are often obtained through surveys that aim to collect representative samples. Public health specialists rely traditionally on detailed sample frames to survey populations. There are however many situations (such as those relating to displaced populations in emergencies) in which detailed sample frames are either unavailable or unfeasible. Only a small number of sampling methods are suitable for such situations.
Ecological methods, which often do not require a detailed sample frame, can offer practical solutions to household sampling problems and are currently being explored. These methods include sequential sampling techniques to estimate prevalence or program coverage [1, 2], capture-recapture techniques [3, 4], adaptive sampling , T-Square sampling  and Catana's wandering quarter method  to estimate population size and density.
One of the problems in validating and verifying sampling methods used in situations devoid of sampling frames is the difficulty in analysing the properties of the sampling methods . Traditional optimisation of sampling methods is done using computationally intensive re-sampling techniques such as Monte Carlo (MC) or Latin Hypercube Sampling (LHS) simulations, while experimenting with different permutations of the parameters of the sampling method on simulated or real population data. Further, from a theoretical perspective, there are infinitely many scenarios (covering a wide distribution of household and individual data) for which the sampling method requires validation and verification.
T-Square sampling is a distance-based sampling method whose statistical properties have been thoroughly investigated [9–14]. It has been used in ecology to estimate sizes, densities and deviations from random spatial distributions of mainly plant populations  and more recently it has been used to estimate the size of displaced human populations in emergency situations [6, 16, 17].
Estimating human populations in emergencies by using distance-based methods, such as the T-Square, rely on collecting data on distances between households (shelters) rather than on households per se. Advantages of distance sampling methods include:
Human population density can be estimated even when not every household per unit area is detected;
The same population density estimate can be calculated from data independently collected by multiple observers;
A relatively small number of distances need to be measured;
Two of the substantive issues to be addressed in this paper are whether:
The assumptions on which the T-Square method is originally based for estimating plant population sizes are equally valid for estimating human population sizes;
The T-Square method can be optimised.
T-Square sampling and other distance-based methods
Choosing the appropriate distance-based method for use in human populations requires careful practical and theoretical considerations. Distances within which a surveyor can determine accurately the closest household from a random point or the closest household from a previously selected household are limited. In practice, it could be difficult to identify precisely the location of a household that occupies a large area. Furthermore some sampling methods are more sensitive than others to errors in the measurement of angles and distances. In the T-Square method the sample observations are pre-determined, unlike the wandering quarter method. The wandering quarter method could therefore be more difficult to plan in advance compared to the T-Square method if health data are to be collected from each household.
In addition to T-Square sampling and the Catana's wandering quarter methods, there are other distance-based methods such as the line-transect and point-transect distance methods [18, 19]. It could be argued that although these methods are well established for estimating abundance of biological populations (plants or animals), extrapolating their use to household surveys would require evaluation. We note however that distance-based methods do not replace classical sampling methods where sample frames are available.
Optimisation of the T-Square sampling method
The elements of optimising any household sampling method are the objective function (performance measure) to be optimised (maximised or minimised), the parameters of the method which can be tuned to optimise the objective function, and the constraints that are imposed on the values of these parameters . In the context of optimising the T-Square method this is translated as follows.
The choice of the objective function to be optimised is not arbitrary and should be carefully considered. In real-life applications, a set of empirically-derived objective functions would be proposed and tailored to particular situations. Appendix II derives a simple objective function based on practical considerations. We present several examples of objective functions in the following paragraphs.
The simplest objective functions to be optimised (minimised in this case) are the standard error of the estimate of the average area per household (E) or the "cost" of the sampling (C), defined in a generic sense, as a measure of the "quantity of resources" required for sampling (for example, human resources). We can define an objective function which combines both those functions: T = E + αC where α is a trade-off scalar, or parameter, which has a dual purpose: to scale E and C numerically to the same unit and to weight the relative significance of each of them in terms of the overall performance measure.
An obvious parameter to tune is the number of sampling points (m). Both terms (E and C) in the above combined objective function depend on m. We would expect E(m) to decrease monotonically with respect to m and C(m) to increase monotonically with m thus providing a trade-off in the choice of m to be optimised.
A key assumption in the optimisation analysis is that the distribution of the households can be described adequately by a two-dimensional spatially homogeneous Poisson process (Appendix I). In using the T-Square method, there is a potential bias in the estimate of the household density (mean number of households per unit area) if the Poisson assumption does not hold. The standard error term E(m) is proportional to provided the sampling points are well spaced. The constant of proportionality however will depend on the underlying distribution and therefore would influence the optimal solution. Unlike the expression for E(m), the expression of C(m) is derived from practical considerations. The constraints on m are usually in the form of simple bounds on the sample size, i.e. greater than zero, but less than 60.
The minimisation was carried out in Mathematica using a standard non-linear programming optimisation algorithm . The optimal sample size (to the nearest integer) is m* = 58.
The two previous simulations were concerned with optimising sample size. Once the optimal sample size is determined, one can envisage a second optimisation stage whose aim is to select the optimal pathway for data collection. This could be required in practice for operational reasons and is not necessarily reflected in the cost function of the first stage optimisation problem. The optimal pathway is defined as the shortest pathway connecting all the sampling points. It is assumed here that one observer would be carrying out the survey.
The optimisation is concerned with computing the shortest pathway that connects all the sampling points. This is a very well known and classical problem in combinatorial optimization known as the "Travelling Salesperson Problem" . The problem is to determine the least-distance route taken by a salesperson to visit a fixed number of cities in which each city is visited once only and in which the trip starts and ends at the same point. The Travelling Salesperson Problem (TSP) is not easy to solve (computational difficulty increases with the number of cities) and there is extensive literature on fast and efficient numerical algorithms used to solve both the classical version and more complex variations of the TSP [22, 23].
Here, we solved the TSP problem in Mathematica [20, 24]. The optimisation method used is called simulated annealing. Simulated annealing is a stochastic approach to find the global solution of an optimization problem where there could be multiple local solutions . In this approach, an optimal solution is found iteratively by selecting randomly at each step a point in the neighbourhood of the current solution and then directing the search in the subsequent steps to improve the value of the objective function whilst not getting trapped in a local solution. It has been found that simulated annealing has several advantages over other optimization methods to solve TSP . (Additional information and an illustration of simulated annealing ).
Because of the strict condition of complete randomness demanded by the T-Square sampling method, it is unlikely that this method would always be applicable. Catana's method could prove a valid alternative in the sense that it does not require complete spatial randomness however no results have been published for its use in human populations. As in the case of the T-Square method, Catana's method also has some restrictions in practice as discussed previously.
The purpose of this paper was to illustrate the principle of optimising a household sampling method in situations where sampling frames are unavailable. We chose the T-Square method as the exemplar because it holds promise for estimating population sizes in such situations. The optimisation of the T-Square method was demonstrated using a simple illustrative example depicting scenarios that are faithful to the basic assumption of the method, namely that the distribution of the households can be described by a two-dimensional homogeneous Poisson process. If this assumption does not hold, then the proposed optimisation procedure would likely be sub-optimal. Further work should investigate optimising the T-Square method in scenarios that are more realistic and situations in which the distribution of the households is not described by a spatially inhomogeneous Poisson process.
The rigorous optimisation approach, which was demonstrated here on the T-Square method, can be applied to any other sampling method. Traditionally sampling methods were validated using computer simulations and were not formally optimised. The scope of the traditional computing-intensive approaches are somehow limited and the necessity of a mathematical approach for validation and optimisation is warranted .
Optimisation of sampling methods provides important information for surveys in contexts where sampling frames are not available. These techniques may be contained within computer software used by field survey teams without requiring technical knowledge of the algorithm. That is, a user-interface allowing survey teams to enter their objective function and generate an optimal survey strategy can mask formulae making them easier for use by non-technical survey teams. Instead of asking survey teams to define the objective function, they could be led through a set of heuristics which provide the number of points to be sampled. For example, in the case of the T-Square method, if the distribution of dwellings is uniform (e.g. as in a street-structured refugee camp) then sample m1 points, if the distribution of dwellings is clumped (e.g. as in a village-structured refugee camp) then sample m2 points. Another way to envision this step would be to ask a similar set of heuristics which are then translated into an objective function behind the user-interface. The second stage of optimisation, the travelling salesperson problem, could be contained within computer software and adapted for use in the field. These heuristics could be tailored to the key issues at hand in other sampling methods.
The T-Square sampling method can be described simply in figure 3. We assume that individuals live in households that are not enumerated (i.e. there is no sampling frame). In emergencies, impromptu shelters grouped haphazardly represent households. Points H1, H2 and H3 represent the locations of three of the households. The region of interest (Ω) could contain n households (H1...H n ). Point S1 represents an arbitrary chosen point in Ω. It represents one sample of m points (S1...S m ), which are generated randomly and used as anchors for the estimation method.
Recall the description of figure 3. C is the straight line joining S1 to the nearest household (H1). Q is the line perpendicular to C at household H1. Q partitions the Ω plane into two semi-planes R and L indicated by the arrows. Household H2 is the nearest to H1 on the R semi-plane. The distance between S1 and H1, and the distance between H1 and H2 are denoted by x and y, respectively.
In Equation (I.1), NA and NB are respectively the number of households in regions A and B, and λ is the density (number of households per unit area) of the underpinning Poisson process and the parameter to be estimated.
Of course, the principal assumption of the T-Square method is very restrictive in the context of human population estimates. There are several statistical tests available to test for complete randomness of spatial point patterns [9, 12–14, 28–31]. The relaxation of this assumption has implications for the robustness of the method (see below) used to estimate λ .
It follows from Equation (I.2) that the random variable ω defined by ω = 2π λ x2 is chi-square (χ2) distributed with 2 degrees of freedom .
If we selected the households arbitrarily, instead of the sampling points, and measured the distance between each selected household and its nearest neighbour, this distance will have the same pdf as x. However, households cannot be selected arbitrarily without enumeration of these households.
where η is the average area per household.
where κ is the average household population and Γ is total the area of region Ω.
This section describes a simple objective function which has been used in practice to determine sample size requirements in cluster surveys on provision of water, sanitation and hygiene. The cluster surveys used a two stage sampling approach. In the first stage the primary sampling units (PSUs) were selected with a probability proportioned to their size. In the second stage a simple random sample of size b was taken from each PSU, where b is the number of basic sampling units (BSUs) within each PSU. b is also known as the 'take'.
- Myatt M, Feleke T, Sadler K, Collins S: A field trial of a survey method for estimating the coverage of selective feeding programmes. Bull World Health Organ. 2005, 83 (1): 20-26.PubMed CentralPubMedGoogle Scholar
- Brooker S, Kabatereine NB, Myatt M, Russell Sothard J, Fenwick A: Rapid assessment of schistosoma mansoni: the validity, applicability and cost-effectiveness of the Lot Quality Assurance Sampling Method in Uganda. Trop Med Int Health. 2005, 10 (7): 647-658. 10.1111/j.1365-3156.2005.01446.xPubMed CentralView ArticlePubMedGoogle Scholar
- Luan R, Zeng G, Zhang D, Lou L, Yuan P, Liang P, Li Y: A study on methods of estimating the population size of men who have sex with men in Southwest China. European Journal of Epidemiology. 2005, 20: 581-585. 10.1007/s10654-005-4305-4View ArticlePubMedGoogle Scholar
- Chao A, Tsay PK, Lin SH, Shau WY, Chao DY: The applications of capture-recapture models to epidemiological data. Statist Med. 2001, 20: 3123-3157. 10.1002/sim.996.View ArticleGoogle Scholar
- Martsolf DS, Courey TJ, Chapman TR, Draucker CB, Mims BL: Adaptive sampling: recruting a diverse community sample of survivors of sexual violence. J Community Health Nurs. 2006, 23 (3): 169-182. 10.1207/s15327655jchn2303_4View ArticlePubMedGoogle Scholar
- Grais RF, Coulombier D, Ampuero J, Lucas MES, Barretto AT, Jacquier G, Diaz F, Balandine S, Mahoudeau C, Brown V: Are rapid population estimates accurate? A field trial of two different assessment methods. Disasters. 2006, 30 (3): 364-376. 10.1111/j.0361-3666.2005.00326.xView ArticlePubMedGoogle Scholar
- Catana AJ: The wandering quarter method of estimating population density. Ecology. 1963, 44: 349-360. 10.2307/1932182.View ArticleGoogle Scholar
- Bostoen K, Chalabi Z: Optimising household survey sampling without sample frames. International Journal of Epidemiology. 2006, 35 (3): 751-755. 10.1093/ije/dyl019View ArticlePubMedGoogle Scholar
- Besag J, Gleaves JT: On the detection of spatial pattern in plant communities. Bulletin of the International Statistical Institute. 1973, 45 (1): 153-158.Google Scholar
- Diggle PJ: Robust density estimation using distance methods. Biometrika. 1975, 62 (1): 39-48. 10.1093/biomet/62.1.39. 10.1093/biomet/62.1.39View ArticleGoogle Scholar
- Diggle PJ: The detection of random heterogeneity in plant populations. Biometrics. 1977, 33: 390-394. 10.2307/2529790.View ArticleGoogle Scholar
- Diggle PJ: Statistical methods for spatial point patterns in ecology. In Spatial and temporal analysis in ecology Edited by: Cormack RM, Ord JK. Fairland, Maryland , International Co-operative Publishing House; 1979.Google Scholar
- Diggle PJ: Statistical analysis of spatial point processes. Second edition. London , Arnold; 2003.Google Scholar
- Diggle PJ, Besag J, Gleaves JT: Statistical analysis of spatial point patterns by means of distance methods. Biometrics. 1976, 32: 659-667. 10.2307/2529754.View ArticleGoogle Scholar
- Young LJ, Young H: Statistical ecology: a population perspective. Boston , Kluwer Academic Publishers; 1998.View ArticleGoogle Scholar
- Brown V, Jacquier G, Coulombier D, Balandine S, Belanger F, Legros D: Rapid assessment of population size by area sampling in disaster situations. Disasters. 2001, 25 (2): 164-171. 10.1111/1467-7717.00168View ArticlePubMedGoogle Scholar
- Noji EK: Estimating population size in emergencies. Bulletin of the World Health Organization. 2005, 83 (3): 164PubMed CentralPubMedGoogle Scholar
- Buckland ST, Anderson DR, Burnham KP, Laake JL: Distance sampling: estimating abundance of biological populations. London , Chapman and Hall; 1993.Google Scholar
- Buckland ST, Anderson DR, Burnham KP, Laake JL, Borchers DL, Thomas L: Advanced distance sampling. Estimating abundance of biological populations. Oxford , Oxford University Press; 2004.Google Scholar
- Wolfram S: Mathematica, Fifth Edition. Champaign IL , Cambridge University Press; 2003.Google Scholar
- Lawler EL, Lenstra JK, Rinnooy Kan AHG, Shmoys DB: The traveling salesman problem. A guided tour of combinatorial optimization. Chichester , John Wiley & Sons; 1985.Google Scholar
- Moon C, Kim J, Choi G, Seo Y: An efficient genetic algorithm for the traveling salesman problem with precedence constraints. European Journal of Operational Research. 2002, 140: 606-617. 10.1016/S0377-2217(01)00227-2.View ArticleGoogle Scholar
- Snyder LV, Daskin MS: A random-key genetic algorithm for the genralized traveling salesman problem. European Journal of Operational Research. 2006, 174: 38-53. 10.1016/j.ejor.2004.09.057.View ArticleGoogle Scholar
- Kripfganz J, Perlt H: Operations Research 3.1. A Mathematica application package. Leipzig , SoftAS Gmbh; 2005.Google Scholar
- Pham DT, Karaboga D: Intelligent optimization techniques. Genetic algorithms, Tabu search, simulated annealing and neural networks. London , Springer-Verlag; 2000.Google Scholar
- Nemhauser GL, Wolsey LA: Integer and combinatorial optimization.New York , John Wiley & Sons; 1999.Google Scholar
- Simulated Annealing http://www.cs.sandia.gov/opt/survey/sa.html
- Byth K, Ripley BD: On sampling spatial patterns by distance methods. Biometrics. 1980, 36: 279-284. 10.2307/2529979.View ArticleGoogle Scholar
- Cormack RM: The invariance of Cox and Lewis's statistic for the analysis of spatial patterns. Biometrika. 1977, 64 (1): 143-144. 10.2307/2335785.View ArticleGoogle Scholar
- Hines WGS, O'Hara Hines RJ: The Eberhardt statistic and the detection of nonrandomness of spatial point distributions. Biometrika. 1979, 66 (1): 73-79. 10.1093/biomet/66.1.73.View ArticleGoogle Scholar
- Holgate P: Tests of randomness based on distance methods. Biometrika. 1965, 52 (3-4): 345-353. 10.1093/biomet/52.3-4.345. 10.1093/biomet/52.3-4.345View ArticleGoogle Scholar
- Bennett S, Radalowicz A, Vella A, Tomkins A: A computer simulation of household sampling schemes for health surveys in developing countries. International Journal of Epidemiology. 1994, 23 (6): 1282-1291. 10.1093/ije/23.6.1282View ArticlePubMedGoogle Scholar
- Kish L: Survey sampling. New York , John Wiley & Sons; 1965.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.