The degree of agreement or the coefficient of determination R 2. This coefficient is calculated from the ratio of explained variance to the total variance of Y:. In assessment of a method comparison evaluating this coefficient is necessary as it shows fit of the model: The closer the coefficient gets to 1, the better the regression line fits actual data points.
However, it must be noted that even at numbers very close to 1 significant bias may exist. For laboratory medicine purposes we should aim for a R squared score of more than 0. To check for mean we run a paired t-test, and, to check for variance, we run an f-test. The t-statistics can be calculated by:. To determine the significance of the results the p-value , the t-statistics should be looked up on a t table corresponding the degree of freedom; the degree of freedom in paired t-tests equals n—1.
A t-test with a significant p-value signifies the presence of a significant bias in the mean of the methods. The next step then would be to determine whether the systematic error represents a constant bias or a proportional bias.
This can be done by examining the regression curve or equation. The presence of an intercept signifies a constant bias while presence of a slope other than 1 signifies proportional error. The correction for a constant bias is simple and would require adding the constant to the measurement results. Correction of the proportional bias, however, requires a recovery experiment as described in Section 3. The f-test compares the expected variance of the values to the observed variance; while the t-test compares the centroid of the data points the mean , the f-test deals with the distribution and variance of the data points the variance.
The t-test is more sensitive to differences in the values in the middle of the data range while f-test is more sensitive to differences in the extremes of the data range. A significant f-test would signify random error in the measurement or in other words imprecision. To calculate the f-test the following equation is used the larger of the two variances will always be the numerator and the smaller one the denominator in this fraction :.
The degree of freedom of the f-test is n-1, n-1 and the significance threshold can be looked in a f-table corresponding the degree of freedom. It is important to perform the f-test prior to the t-test; one of the basic assumptions of the t-test is that the standard deviations of the data points are similar between the two groups, i.
In presence of a significant imprecision, the determination of presence of a significant bias should be done using a Cochran variant of the t-test. In Cochran variant of t-test, standard deviation cannot be pooled between the two groups:.
Accuracy profiling has moved away from treating bias and imprecision as separate entities. In fact, most guidelines whether based on the total error principles or measurement uncertainty principles combine bias and imprecision for acceptability criteria.
To calculate bias and imprecision, we need to run a reproducibility study. Reproducibility of quantitative studies is obtained by repeated measurements of a sample in a series and then conducting multiple series of reproducibility studies. The overall measurement of bias will be the difference between the mean value of the analyte obtained from the repeated measurement and the reference value:. Bias and imprecision are used to form the tolerance interval; it is the interval which, with a determined degree of confidence, a specified proportion of results for a sample fall.
Tolerance interval can be expressed as:. For laboratory medicine, the tolerance interval of analytes needs to be smaller than the acceptability limits. The important factor from intermediate precision that is needed in calculation of tolerance interval is the standard deviation of reproducibility S R. The standard deviation of reproducibility can be calculated by the following equation:. An advantage of calculating the intermediate precision is that we can use it in combination with within- series repeatability to determine the uncertainty of bias:.
S r 2 is the within-series repeatability and can be calculated using the following equation:. Uncertainty of bias is essentially 1. The between-series reproducibility is used in calculation of the Mee factor K s. Mee factor is the other component of intermediate precision. Since the calculation of the Mee factor is complicated we have broken it down into a series of equations. The first step is to calculate the H ratio:. The next step is to calculate the G 2 :. The final step is to multiply C by the t-score associated with the degree of freedom dof :.
By calculating the Mee factor and the standard deviation of reproducibility we can now obtain the intermediate precision:. Thus, we can rewrite the tolerance interval as [ 12 ]:. The problem with simple linear regression is that is based on a set of assumptions; one of the problematic assumptions is that the standard deviation of the random error is constant throughout the range of measurement. This assumption, however, is often wrong as the standard error of measurement is often much larger near the extremes of measurement range near the limit of detection and the highest range of linearity.
The solution in laboratory medicine can be to run linearity experiments and limit the measurement range based on the linearity results. Despite this the effect of random variation on the regression line remains. To rectify this, a solution is to employ a weighting procedure.
The simplest weighting procedure is to use the standard deviation of variation for each data point of the method comparison study. This requires that the method comparison study is repeated multiple times times. This allows us to calculate the standard deviation of measurement for each point S i. The weighting coefficient will then be the inverse of this standard deviation:.
This weight can then be incorporated into the equations of the method comparison. For example, the r coefficient can be recalculated as:. Weighting can often considerably decrease the bias percentage especially at the extremes of measurement compared to non-weighted regression. Weighting by inverse of standard deviation tends to normalize the relative bias at the extremes of measurement while weighting by inverse of variance tends to favor the bias correction for lower ends of measurement less bias at lower concentrations.
To estimate the proportional bias, a recovery experiment is needed. The recovery experiments are performed by calculating the amount of recovery when adding a known amount of the analyte to the sample: this is done by dividing the measurement sample into two equal aliquots and performing the measurement for both aliquots.
To one of the aliquots, a known amount of target analyte is added aliquot 1. For the other aliquot aliquot 2 an equal amount of diluent is added and the measurement is repeated. The recovery percentage can then be calculated:. The recovery or bias percentage is often used in laboratory medicine to state the proportional bias.
Most of the regulatory agencies have set critical values for the recovery percentage for different analytes. The advantage of using recovery percentage is that it normalizes to allowing for easier understanding of the scale of bias present [ 2 ]. Up to this point we have discussed bias detection methods that use a reference material or comparator to assess the presence of bias. While this has been the accepted standard for many laboratory regulatory agencies, there are arguments against this approach to bias detection: first of all, the assumption of method comparison studies is that the reference material control samples values are true and do not suffer from imprecisions.
The measurement uncertainty is considered to be minimal in these samples. Yet, unless these samples vary considerably from the biologic sample matrix, a degree of measurement uncertainty would exist in these samples which lead to inaccurate estimates of bias and imprecision of laboratory instruments and techniques.
On the other hand, running repeated control samples with each run and the need for revalidation of the instrument and techniques after each change in the parameters, requires a considerable investment in terms of time, labor and cost.
Alternatively, the systematic error can be determined by using the patient samples. This can be done by either tracking the results of known normal patients i. Using patient samples has the advantage of including the inherent biologic uncertainty into the calculation of bias. In this approach the comparator for quality control would the average values of the analyte in normal individuals. This requires us to know the population average and standard deviation for that analyte. If we measure the analyte in a normal individual, we would expect the results to approximate the population average.
Deviations of the normal results from the expected reference normal can signal the presence of a systematic error. In AON, the mean value of normal samples is compared to a mean reference value. The mean reference value should be established by the laboratory based on the population it serves; this is best done as part of the initial validation of an assay when a large size sample of normal individuals is tested to establish the reference ranges.
With each analytical run, a sample of normal results should be used to calculate the Average of Normals for that analytical run. In AON method, as the size of the normal sample increases the probability of detecting bias also increases. To help with these calculations, one can utilize the Cembrowski nomogram [ 14 ] or, alternatively, the methods used in [ 15 ]. It is also possible to perform the AON by performing a two-sample independent t-test. Unlike the AON method, in moving patient averages, all the results of an assay are included in evaluation of bias.
The principle for moving patient averages is that the samples tested in a laboratory follow a repeating pattern. This assumption means that the overall biologic and clinical spectrum of patients and individuals tested in the laboratory is constant throughout the analytical runs. In moving patient averages, we expect the average results of an assay for two overlapping subsets of patient to be constant.
Systematic error can often be avoided by calibrating equipment, but if left uncorrected, can lead to measurements far from the true value. If you take multiple measurements, the values cluster around the true value. Thus, random error primarily affects precision. Typically, random error affects the last significant digit of a measurement.
The main reasons for random error are limitations of instruments, environmental factors, and slight variations in procedure. For example:. Because random error always occurs and cannot be predicted , it's important to take multiple data points and average them to get a sense of the amount of variation and estimate the true value. Systematic error is predictable and either constant or else proportional to the measurement.
Systematic errors primarily influence a measurement's accuracy. Typical causes of systematic error include observational error, imperfect instrument calibration, and environmental interference. Once its cause is identified, systematic error may be reduced to an extent.
Systematic error can be minimized by routinely calibrating equipment, using controls in experiments, warming up instruments prior to taking readings, and comparing values against standards.
While random errors can be minimized by increasing sample size and averaging data, it's harder to compensate for systematic error. The best way to avoid systematic error is to be familiar with the limitations of instruments and experienced with their correct use. Actively scan device characteristics for identification. Use precise geolocation data. Select personalised content. Create a personalised content profile.
Measure ad performance. Select basic ads. The accuracy of the t-test generally grows as the size of systematic error increases, but this trend is not as steady as in Simulation 1: The curve's minimum is not always associated with the lowest systematic noise e.
At the same time, the accuracy of the K-S test is extremely low in almost all of the considered situations. The sensitivity patterns shown in Figures 7SM, 8SM and 9SM demonstrate that the sensitivity of the three statistical tests grows as the percentage of hits contained in the data increases.
As a final step in our study we applied the three discussed systematic error detection tests on real HTS data. We examined the impact that the presented methodology would have on the hit selection process in McMaster Data Mining and Docking Competition Test assay [ 18 ]. Similarly to Simulations 1 and 2 carried out with artificial data, we performed two types of analysis. First, we studied the raw HTS measurements, and then calculated and analyzed the hit distribution surfaces of Test assay.
We carried out the t-test on every plate of Test assay , scanning all rows and columns of each plate for the presence of systematic error. In each case, we counted the number of rows and columns in which the test reported the presence of systematic error and also the number of plates in which at least one row or column contained systematic error. The collected results are presented in Table 2. The obtained results suggest that the number of positives for the row and column effects is almost exactly what we would expect by chance e.
This means that there is no statistical evidence of bias for columns and rows in McMaster Test assay. For comparative purposes, we corrected the raw McMaster data using the B-score method in all plates where systematic error was detected by the t-test.
Unlike the artificially generated data used in the simulation study, McMaster Test assay contained replicated plates - every compound of the assay was screened twice [ 18 ]. We adjusted our hit selection procedure to search for average hits. Thus, we first calculated the average of the two compound measurements and then used it in the hit selection process. If systematic error was detected only in first plate and, therefore, corrected using the B-score method, then the residuals produced by B-score were incomparable with the values of the second i.
In order to make the measurements in both plates comparable, we normalized both plates by means of the Z-score method prior to calculating the average compound activity. The obtained results are reported in Tables 3 and 4 , respectively.
A comparison between the original set of hits and the newly selected hits is also made in these tables. In fact, these tables report how many of the original hits remained hits, how many of them were removed and how many new hits were selected.
While the hit distribution surface is useful for detecting the presence of overall bias, it does not capture the variability of the bias on a plate-by-plate basis.
Finally, we also applied the Well correction method to remove systematic error from McMaster Test assay. Table 5 reports the comparative results of the two hit selections.
Figure 8 presents a summary of our experiments conducted with McMaster Test assay. The pairwise intersections between the three obtained sets of hits are presented. The dashed grey area in the middle represents the intersections between the three hit sets and thus defines the consensus hits for McMaster Test assay.
The results provided by the B-score method hits in total shows that this data correction procedure tends to overestimate, at least when compared to Z-score and Well correction, the number of hit compounds. On the other hand, the results provided by the Well correction method suggest that about one third of the original hits could be, in fact, false positives and that about the same percentage of false negatives could be ignored if systematic error present in the raw McMaster data is not identified and removed adequately.
Intersections between the original set of hits 96 hits in total and the sets of hits obtained after the application of the B-score hits in total; the method was carried out only on the plates where systematic error was detected and Well correction methods hits in total computed for McMaster Test assay. In this article we discussed and tested three methods for detecting the presence of systematic error in experimental HTS assays.
We conducted a comprehensive simulation study with artificially generated HTS data, constructed to model a variety of real-life situations. The variants of each dataset, comprising different hit percentages and various types and levels of systematic error, were examined.
The experimental results show that the method performances depend on the assay parameters - plate size, hit percentage, and type and variance of systematic error. The t-test demonstrated a high robustness when applied on a variety of artificial datasets. We can thus recommend the t-test as a method of choice in experimental HTS. On the contrary, advocated in some works [ 20 , 21 ] Discrete Fourier Transform followed by the K-S test yielded very disappointing results.
Moreover, the latter technique required a lot of computational power but provided the worst overall performance among the three competing statistical procedures.
The main reason for such a disappointing performance of the K-S test is it that was applied, as recommended in [ 20 ], on the data already transformed by the Discrete Fourier method.
Figure 19SM presents an example of data from one of the simulated well plates before and after the application of Discrete Fourier Transform. The raw data followed a normal distribution and contained random error only i. In general, its performances were lower than those of the t-test and were very sensitive to the type of systematic error as well as to its variance. In addition to the experiments with simulated data, we applied the three discussed systematic error detection tests to real HTS data.
Our goal was to evaluate the impact of systematic error on the hit selection process in experimental HTS. The obtained results see Tables and Figure 8 confirm the following fact: If raw HTS data are not treated properly for eliminating the effect of systematic error, then many e. A special attention should be paid to control the results of aggressive data normalization procedures, such as B-score, that could easily do more damage by introducing biases in raw HTS data and, therefore, lead to the selection of many false positive hits even in the situations when the data don't contain any kind of systematic error.
Our general conclusion is that a successful assessment of the presence of systematic error in experimental HTS assays is achievable when the appropriate statistical methodology is used. This test should help improve the "quality" of selected hits by discarding many potential false positives and suggesting new, and eventually real, active compounds.
The t-test should be used in conjunction with data correction techniques such as: Well correction [ 5 , 6 ], when row or column systematic error detected by the test repeats across all plates of the assay, and B-score [ 3 ] or trimmed-mean polish score [ 7 ], when systematic error varies across plates.
Thus, we recommend adding an extra preliminary systematic error detection and correction step in all HTS processing software and using consensus hits in order to improve the overall accuracy of HTS analysis. Heyse S: Comprehensive analysis of high-throughput screening data. Proc of SPIE , — PharmaVentures; Google Scholar. J Biomol Screen , 8: — Article PubMed Google Scholar. J Biomol Screen , — Bioinformatics , — Marcel Dekker, Inc; Math Comput , — Article Google Scholar.
Dove A: Screening for content - the evolution of high throughput. Nat Biotechnol , — Kaul A: The impact of sophisticated data analysis on the drug discovery process. Business briefing: future drug discovery J Biomol Screen , 4: 67— J Comb Chem , 2: — Exploratory Data Analysis J Am Chem Soc , — Software Report.
Datect Inc. Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas , 37— Fleiss JL: Statistical methods for rates and proportions. New York; Download references. We also thank Professor Jean-Jacques Daudin and two anonymous referees for their helpful comments. All authors read and approved the final manuscript. Penfield Ave. You can also search for this author in PubMed Google Scholar.
Correspondence to Vladimir Makarenkov. PD and VM designed the study. PD implemented the statistical tests and carried out the simulations. VM and RN supervised the project and coordinated the methodological development. Additional file Additional file 1 includes Supplementary Materials for the article.
DOC 5 MB. Reprints and Permissions. Dragiev, P. Systematic error detection in experimental high-throughput screening. BMC Bioinformatics 12, 25 Download citation.
Received : 30 June Accepted : 19 January
0コメント