This post is part 3 of a 3-part series that provides an overview of the development and validation of VoiceSignals’ People Intelligence Platform©.
The overall development and validation steps are shown in Table 1, along with the part of the series in which each step is discussed. Links to part 1 and 2 are provided at the end of this paper.
The procedures and techniques used to develop and validate the AI are in accordance with best practices and standards for both AI engineering and psychometric assessment.
The development of Al algorithms proceeded with high-quality data (described in parts 1 and 2). The first step involved extracting over 500 acoustic features (e.g., pitch, amplitude) from the speech files and storing them as predictive variables. These acoustic variables were then combined in a dataset with the raters' aggregated ratings of personality, mindset, and emotion.
Multiple algorithms were then developed to predict a speaker's psychological characteristics from the acoustic features present in their speech. Each algorithm was developed and tested using a thorough two-step validation process. First, each algorithm was created using a k-fold cross-validation3 strategy. This procedure involves the iterative training and testing of the algorithm using a large dataset. The second validation step tested the algorithm's performance in a new set of data on which the algorithm had not previously been trained. This type of validation is often referred to as a holdout validation strategy. Combing both the k-fold and holdout validation approaches is considered best practice for testing the performance of an algorithm.
The metrics used to assess the performance of each algorithm were identified before model testing and included the Positive Predictive Value4 (PPV) measure of accuracy for emotions, due to their binary coding in the dataset, and the Coefficient of Determination5 (R2) for both the personality traits and mindset characteristics, due to their coding on a continuous scale in the dataset. Minimum thresholds were selected for both performance metrics.
Results from algorithm performance in the holdout dataset are presented below. Multiple algorithms were tested for each psychological characteristic, and results showed that penalized regression-based algorithms had superior performance to other algorithms. Penalization in Al algorithm development refers to a mathematical function identifying a subset of crucial predictor variables from a much larger set of predictor variables. Applying penalization results in algorithms that have better predictive accuracy when exposed to new data and are more transparent and explainable. The Least Absolute Shrinkage and Selection Operator (LASSO) penalization method was used for all regression algorithms described below.
LASSO logistic regression algorithms were used to predict each of the eight basic emotions shown in Table 5, given the binary coding of emotion labels in the dataset. All emotion algorithms passed the minimum (PPV) threshold. Each algorithm was statistically significant, and the average accuracy (PPV) across all algorithms was very high.
All algorithms passed the minimum R2 threshold for personality traits and showed an excellent prediction of personality traits. All algorithms were statistically significant.
All mindset algorithms passed the minimum R2 threshold and were statistically significant.
As mentioned above, the AI algorithms displayed very strong predictive accuracy during development and when exposed to new data.
Typically, the development and testing of AI algorithms would be considered complete at this stage, and the algorithms would be ready for use in real-life settings. However, best practices and standards in psychological assessment outline additional steps that should be taken to demonstrate the performance and efﬁcacy of any algorithm designed to predict the psychological characteristics of a person or group. These steps are discussed in the following section.
The deﬁnition of the term "test" used in the Standards for Educational and Psychological Testing includes algorithms designed to predict any psychological characteristic. Therefore, the algorithms described in Section 7 fall within the scope of these standards. Essentially, the standards outline a series of investigations that should be conducted before any test is considered valid for a particular use case.
This section presents the validity evidence of the personality trait, emotional state, and mindset scores provided by VoiceSignals' People Intelligence Platform©. The procedures and analyses described below are aligned with the Standards for Educational and Psychological Testing.
The investigation of the validity of any psychological assessment typically involves a series of qualitative and/or quantitative analytical methods. Two frequently used methods recommended by the Standards for Educational and Psychological Testing are known as content validity and construct validity analyses. These two analytical methods were used to investigate the validity of the scores output in the People Intelligence Platform©. These analyses and the results are described below.
Content validity involves a subjective judgment made by a group of Subject Matter Experts (SME) regarding the degree to which an assessment adequately measures the target psychological characteristics based on the content (e.g., definitions, questions, rating scales) of the measures included in the evaluation.
The investigation of the content validity of the People Intelligence Platform© involved a review of the degree to which the measures of personality traits, emotional states, and mindset adequately captured all aspects of personality, emotions, and mindset without including any unintentional information (e.g., motivations or interests which are not considered a part of the personality, emotion, or mindset). As mentioned in part 1, the measures of personality traits and emotional states were developed based on previously validated and widely used frameworks from the psychological science literature. Using these frameworks ensured that all relevant aspects of personality (i.e., the 30 facets) and emotions (i.e., the eight basic emotions) were included in the respective measures. Other information not related to either personality or emotion was not accidentally included. These measures were developed and reviewed by separate groups of SMEs in psychometrics and psychological assessment and found to have very high content validity.
Construct validity involves an objective review of how an assessment accurately measures the target psychological characteristics. This consists of comparing the scores produced by an assessment with scores from other measures of both similar and dissimilar psychological characteristics. For example, comparing the scores produced from a new measure of agreeableness with existing and previously validated measures of acceptance and intelligence. To the degree that ratings from the new measure (i.e., a measure of agreeableness) are highly related to ratings of the similar psychological characteristic (i.e., acceptance) and not related to ratings of the dissimilar psychological characteristic (i.e., intelligence), then that measure would have high construct validity.
The investigation of the construct validity of the People Intelligence Platfrom© involved a series of analyses comparing the expected with the actual relationships between the measured psychological characteristics. Expected relationships were established from a review of the psychological science literature. These expected relationships were then compared with the actual relationships revealed through a series of correlation analyses. The results showed a firm correspondence between the predicted and the actual relationships among the personality traits, emotional states, and mindset. This replication of the expected relationships found in the scientific literature is evidence of the construct validity of the Al algorithms embedded in the People Intelligence Platfrom©.
A second investigation into the construct validity of the People Intelligence Platform© was conducted by comparing the scores produced by each AI algorithm with human SME evaluations of a speaker's psychological characteristics using a new sample of audio ﬁles.
Human SMEs listened to each speaker and rated the degree to which the algorithm's score on a particular psychological characteristic was accurate compared to their evaluation. Results showed that the AI ratings were very accurate compared to the human SME ratings.
This series of posts provided an overview of the development and validation methodology behind VoiceSignals' People Intelligence Platform©. Experts in AI engineering and psychological assessment have designed and developed the platform and validated it according to best practices and standards in AI and psychological science. The validation evidence presented in this paper shows that the algorithms embedded in the People Intelligence Platform© are valid predictors of important psychological characteristics and can be applied in real-world settings. The People Intelligence Platform© has been developed to continue to improve its predictions as it learns from more data, and therefore the validation evidence will become increasingly more vital with time.
 Barrick, M. R., & Mount, M. K. (1991). The big five personality dimensions and job performance: a meta‐analysis. Personnel psychology, 44(1), 1-26.
 Karlan, D., Mullainathan, S., & Robles, O. (2012). Measuring personality traits and predicting loan default with experiments andsurveys. Banking the world: Empirical foundations of financial inclusion, 393-410.
 Kassarjian, H. H. (1971). Personality and consumer behavior: A review. Journal of marketing Research, 8(4), 409-418.
 American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC.
 Costa Jr, P. T., & McCrae, R. R. (1995). Domains and facets: Hierarchical personality assessment using the Revised NEO Personality Inventory. Journal of personality assessment, 64(1), 21-50.
 Weidman, A. C., Steckler, C. M., & Tracy, J. L. (2017). The jingle and jangle of emotion assessment: Imprecise measurement, casual scale usage, and conceptual fuzziness in emotion research. Emotion, 17(2), 267.
 Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist, 89(4), 344-350.
 Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2), 155-163.
 Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, 117327-117345.
 Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd Edn. New York, NY: Academic Press