Inter-rater variability of ultrasound scan measurements : balanced incomplete block design

Introduction To assess inter-rater variability of ultrasound scan measurements for determining period of gestation by three raters applying balanced incomplete block design. Methodology Twelve pregnant women who attended the field antenatal clinics were subjected to scan measurements, in terms of bi-parietal diameter (BPD), femur length (FL), abdominal (AC) and head (HC) circumferences of the fetus between 15–24 weeks of gestation. Each pregnant woman was scanned by two of the three raters who were blind to the measurements made by the other using the same machine. Balanced incomplete block design was generated and data were analyzed using ANOVA. Results There were no statistically significant variation among raters in measuring BPD (F = 0.68; p = 0.53), AC (F = 1.99; p = 0.19) and HC (F = 0.06; p = 0.94). There was statistically significant variation among raters for measuring FL (F = 7.4; p = 0.01). Conclusion Statistically significant inter-rater differences were observed only for measurements of FL. However, despite the inter-rater differences of mean abdominal and head circumferences being not significant statistically, their variance can have a clinical significance.


Introduction
One of the uses of ultrasound scan measurements are to estimate the period of gestation and based on that the expected date of delivery.The usual measurements made with regard to the above are bi-parietal diameter, femur length, abdominal circumference and head circumference.These measurements are then converted into period of gestation by applying the suitable regression model for each measurement.According to the literature more accurate measurements are possible when the ultrasound measurements are done between 15 th to 24 th weeks of gestation. 1However; variations in measurements that occur when carried out by several raters may affect the management of pregnancy and its complications adversely.

1.
Senior Lecturer, Department of Public Health, Faculty of Medicine, University of Kelaniya, Ragama, Sri Lanka There are several methods of assessing inter-rater reliability.Latin square design was applied to assess observer variability in anthropometry by 16 field workers using eight children. 5Another study applied a nested Latin square design to determine the inter/intra-rater reliability of three physiotherapists who independently rated pain by visual analogue scale in 33 subjects on three days in a randomized order. 3lanced incomplete block design was used to assess inter-rater reliability of Vancouver Sedative Recovery Scale by 16 raters using 16 children. 4This design has an efficiency index of 0.89 relative to a completely crossed design (in which each of 16 raters would rate each of 16 children). 4Balanced incomplete block design is indicated for comparing the raters' mean levels of rating and whether each mean is estimated with the same precision. 2The advantage of this method is not having the need to rate all the subjects by every rater. 2The objective of this study was to assess inter-rater variability of ultrasound scan measurements for determining period of gestation by three raters applying balanced incomplete block design.

Methods
Twelve pregnant women who attended the field antenatal clinics were invited to participate.Each pregnant woman was asked to come for the ultrasound scan measurements twice on two consecutive days during the 15 th to 24 th weeks of gestation to the Colombo North Teaching Hospital, Ragama.Each pregnant woman was scanned by two of the three raters who were consultant obstetricians.Bi-parietal diameter, femur length, abdominal circumference and head circumference were measured.Second rater was blind to the measurement made by the first rater.All measurements were done using the same ultrasound scan machine.
Balanced incomplete block design was generated 6 (Table 1) with the following features.The three raters (I, II, III) were paired as I and II, II and III and I and III.Each block (participant) was rated by only one pair and the same pair together rated four blocks (For example raters I and II rated together four Blocks namely1, 4 7 and 10).Thus the three pairs covered all the 12 blocks with no overlap between pairs.Each rater assessed eight blocks which appeared eight times in the design.Statistical analysis was conducted by applying ANOVA to the General Linear Model using Minitab 14.

Subjects
Rater I Rater II Rater III  4).A statistically significant variation was not observed among raters with regard to any of the above three measurements: Bi-parietal diameter (F = 0.68; p = 0.53); abdominal circumference (F = 1.99; p = 0.19) and head circumference measurements (F = 0.06; p = 0.94) by the three raters.

Discussion
The study showed that bi-parietal diameter, abdominal and head circumference were more reliable measures of predicting period of gestation than FL.The difference between the lowest and the highest mean bi-parietal diameter of two raters was 1.2mm which is also not clinically significant when converting to period of gestation.Even though there were no statistically significant differences of mean abdominal and head circumference measurements between three raters, the differences between the lowest and the highest mean abdominal and head circumference were 22 mm (144 -122 mm) and 25 mm (175.6 -150.6 mm) respectively.These differences reflect a difference of two weeks in terms of the period of gestation in respect of each measurement, which may have a greater clinical significance.
Further our study found that there was a statistically significant variation between three raters for measuring femur length.The difference between the lowest and the highest rater of the femur length measurement was 7.5 mm which is approximately two weeks difference by period of gestation. 1One study found that correlation coefficient of gestational age versus fetal femur length is statistically greater than that of the gestational age versus fetal bi-parietal diameter. 7This study suggested that the measurement of the fetal femur length was a more precise index of gestational age than the bi-parietal diameter. 7,8nother study reported that even for mothers between 19 and 32 completed weeks gestation there were no statistically significant differences in femur length vs. gestational age between the various racial categories. 9e incomplete block design enabled three raters to assess 12 pregnant mothers and had two major advantages.It avoided the presence of a large group of participants for a reliability study which saves cost.Therefore it minimized the ethical problems and the inconvenience caused by using a larger number of participants who need to be scanned three times within the same week.For a Latin square design at least five mothers should be scanned by five raters.Therefore each participant has to come for the scan five times which would have been inconvenient for both participants and raters.

Conclusion
Statistically significant inter-rater differences were observed only for measurements of FL.However, despite the inter-rater differences of mean abdominal and head circumferences being not significant statistically, their variance can have a clinical significance.