Two fundamental issues that need to be addressed when using administrative data for statistical purposes relate to the relationship between the administrative source and the statistical purpose. These might be termed coverage and content.
Handling quality issues such as coverage and content
For both coverage and content issues, a key question is whether a statistic derived from administrative data is available alongside other official statistics or is effectively the only source of information on an issue. In the former case users can always triangulate the two or more sources and this will lead to a more nuanced and better understanding of the relative merits and information content of the two sources.
If the latter case, where there is no alternative, then arguably statistics derived from administrative data have an even greater responsibility to explore the differences between the data and the statistical target. We understand reported crime statistics in the context of the crime survey but imagine if there were no crime survey – how could users place reliance on the reported crime figures alone?
The concept of the target population for which a statistical estimate is produced is fundamental to the way that official statisticians think about the quality of any estimate. If the data do not reflect the target population, then bias is always a concern.
‘Coverage’ also includes the general issue that the statistical target population and the administrative purpose may have different foci and this will result in the usual concerns of under-coverage and duplication for the statistical purpose, even if the administrative data base is absolutely perfect for its purpose.
In practice the administrative data are never perfect and this just adds to the coverage issue. Whether the coverage problem matters will depend on the size of the problem and whether or not the missing, or duplicated individuals have the potential to seriously bias the statistics produced. The same AD may be suitable for one statistical purpose but less so for another.
For example, it is entirely legitimate to produce statistics about a public service (e.g. NHS) that is publicly funded, if only to monitor the performance of the service and to ensure that it is performing properly. The activities within the NHS and the various medical interventions provided and outcomes are all of legitimate concern.
However, if the target population of concern is the overall health provision to the nation then the absence of data from private medical providers might be a significant defect for some types of intervention such as hip and knee replacements, varicose vain surgery or some forms of open heart surgery.
Moreover, even if the statistical target is publicly-funded health provision, then if private suppliers working to NHS service provision contracts are excluded (perhaps because the data is treated as commercial in confidence) then the overall provision funded from public funds is not reflected in the statistics.
Thus, a consideration of whether the administrative data are fit for purpose depends on the clear understanding of the target population and how this relates to the data. It is vital that the target population is conveyed clearly to the public when statistics are produced and any shortfalls with the derived statistics are properly conveyed.
If sub-national statistics are to be produced, then the quality of geographic identifiers on the administrative data comes into question. If these are rarely, or even belatedly, updated as people move then the administrative record will be associated with the wrong location. This results in under-coverage at the true current location and over-coverage at the false historic location. It is the level at which the target population is defined that determines the quality of the statistics derived from the administrative data.
Even if the population units are conceptually well covered, there is the question of whether or not the data items fully reflect the total population of relevant items. Perverse incentives can lead to the officials responsible for recording events failing in their duty. There are many examples of this, in police recorded crime, hospital waiting lists, children excluded from schools. However under-coverage of data items can also occur where under-reporting occurs – for crime, even if the police had recorded all crime reports brought to their attention, the recorded crime data would still be seriously deficient for some types of crime (internet fraud, youth-on-youth crime: stealing mobile phones etc) because the public do not bother to report these to the police.
A third way in which coverage can be seriously affected is if the set of units treated as reflecting a certain target population are distorted because of perfectly legitimate policy decisions. This situation arose in the late 1980s/early 1990s when the only monthly measure of unemployment was derived from the claimant count. Unemployment as a measure of public expenditure on unemployment-related benefits is one thing but as a measure of labour market capacity and related macro-economic indicators it is another. Many people, particularly women, were effectively excluded from the counts because of their in-eligibility for unemployment benefit.
If the nature or extent of the administrative data changes over time then all of the issue above will have an effect on the consistency of time series.
Here the key concern is whether or not the variable on the administrative data reflects the statistical concept that is required for the statistical purpose (but not necessarily the administrative one). The claimant count example above also reflects an issue of the change in content. Every time Government changed the policy on the eligibility criteria for unemployment benefit, it changed the claimant count statistics (and the time series). Heroic as the government statisticians’ efforts were to try to convey the impact of each change, it was the headline figure that dominated public debate and this resulted in a serious loss of confidence in the whole of the statistical system.
Another aspect of the content variables of administrative data is whether these are to be used to create direct estimates of some statistics or whether they will be used, in conjunction with survey data (that measures the desired concept) to greatly increase the quality of the survey estimates by using the power of the size of the administrative data base (e.g. ratio and regression estimation, post-stratification, calibration estimates etc). The second approach is perfectly acceptable, whilst the direct use raises questions about what concept the statistics derived from the administrative data are measuring and to what extent this differs from the statistical intention – either overall or for subgroups.