cross-validation
در نشریات گروه علوم دام-
The accuracy of breeding values for body size latent trait in pigs under different prediction modelsThe present study was performed to quantify a latent variable for body size (BS) from the five linear body measurements, including body length (BL), body height (BH), chest width (CW), chest girth (CG), and tube girth (TG). The study population consisted of N= 5573 Yorkshire pigs, 592 individuals out of them were genotyped using a PorcineSNP80 BeadChip. The body size latent variable was determined using Confirmatory Factor Analysis (CFA). Then, the accuracy of breeding values was obtained using pedigree-based best linear unbiased prediction (PBLUP), genomic best linear unbiased prediction (GBLUP), and single-step genomic best linear unbiased prediction (ssGBLUP) models. The overall fit indices, including standardized root mean square residual (SRMR), root mean square error of approximation (RMSEA), Tucker-Lewis Index (TLI), and comparative fit index (CFI) were obtained for the BS as 0.03, 0.09, 0.93, and 0.96, respectively which imply the adequacy of the considered model for BS construct. The performance of models was measured in a 5-fold cross-validation with 10 repeats to get a more accurate measure of the model's performance. The accuracy of models was compared via the correlation between predicted breeding values (PBV) and estimated breeding values (EBV) metric which was 0.37, 0.30, and 0.28 for PBLUP, ssGBLUP, and GBLUP, respectively. Furthermore, the goodness of fit is measured by the mean square of error (MSE) and Pearson's correlations r(y, ) between observed and predicted phenotypes. The lowest MSE and the highest Pearson's correlations were obtained under PBLUP while the highest MSE and the lowest Pearson's correlations were obtained under GBLUP. The obtained results showed the GBLUP method generally provided lower prediction accuracies than PBLUP and ssGBLUP methods, and also ssGBLUP generated lower prediction accuracy than traditional PBLUP. The performance of ssGBLUP and GBLUP was lower than expected mainly due to the small number of genotyped animals.Keywords: body dimension, Cross-Validation, genetic evaluation, latent variable, PIG
-
به منظور بررسی صحت ارزیابی ژنومی صفات تولید شیر گاوهای هلشتاین ایران در حضور اثر متقابل ژنوتیپ و محیط، از تعداد 344170، 135000و 156840 رکورد روزانه به ترتیب برای مقدار شیر، چربی و پروتئین در دوره شیردهی اول از 34417، 13500 و 15684 راس گاو ماده و 1935 پدر ژنوتیپ شده بر اساس نشانگرهای SNP استفاده شد. این داده ها طی سال های 1392 لغایت 1397 از بانک اطلاعات مرکز اصلاح نژاد دام و بهبود تولیدات دامی کشور استخراج گردید. جهت در نظر گرفتن اثر متقابل ژنوتیپ و محیط از متوسط شاخص دما-رطوبت نسبی (THI) طی سه روز قبل از روز رکوردگیری، به عنوان عوامل محیطی با خصوصیت پیوسته، مربوط به 35 ایستگاه هواشناسی در مجاورت 139 گله گاو هلشتاین با رکورد روز آزمون از 13 استان استفاده شد. مولفه های (کو)واریانس از طریق مدل تابعیت تصادفی یک صفته با استفاده از نرم افزار AIREMLF90 و در تابع لژاندر مرتبه دو برای روزهای شیردهی و THI، برآورد گردید. نتایج نشان داد تغییر THI طی دوره شیردهی، منجر به تغییر مقدار واریانس ژنتیکی افزایشی می شود. تغییرات وراثت پذیری صفات تولید شیر در طول دوره شیردهی نیز مشابه واریانس ژنتیکی افزایشی بود. آنالیز اعتبار سنجی برای مقایسه صحت پیش بینی شده در مدل هایی با و بدون THI منجر به افزایش صحت با قراردادن اطلاعات ژنومی و بهبود نااریبی با وجود THI در مدل می شود. با توجه به تغییر عملکرد دختران گاوهای نر طی روزهای شیردهی و با مقادیر مختلف THI، برای انتخاب گاونر در شرایط مختلف باید اثر متقابل ژنوتیپ و محیط در نظر گرفته شود.کلید واژگان: ارزیابی ژنومی، اعتبارسنجی، شاخص دما-رطوبت، گاوشیری، مدل تابعیت تصادفیIn order to evaluate the effect of genotype by environment interaction on production traits of Holstein cattle of Iran, first lactation test day records of 344170, 135000 and 156840 of milk, fat and protein yield on 34417, 13500 and 15684 cows and SNP markers of 1935 genotyped bulls were used. The production data were retrieved from the Animal Breeding Center and Productions Improvement of Iran’s database which were collected from 2013 to 2018. To consider the interaction of genotype and environment, mean of temperature-humidity index (THI) in three days before each test day records as continuous environmental effect were retrieved from the 35 closest meteorological stations in the vicinity of 139 Holstein herds from 13 provinces. Variance and covariance components were estimated through a single-trait random regression model with orthogonal Legendre polynomials of second order for days in milk and THI using AIREMLF90 software. The results showed that changes in THI across lactation led tofluctuations in additive genetic variance over time. The change in heritability of milk production traits over lactation followed the same trend as additive genetic variance. The results from cross-validation analysis showed that including genomic information into the predictive model, increased prediction accuracy and including THI information increased unbiasedness. Due to the changes in milk production of daughters of bulls across days and THI , genotype by environment interaction should be considered when selecting bulls under different conditions.Keywords: cross-validation, Dairy cattle, Genomic Evaluation, random regression model, Temperature-humidity index
-
تنظیم اولیه و بهینه سازی پارامترهای ورودی روش های یادگیری ماشین گامی اساسی جهت دستیابی به حداکثر صحت پیش بینی ژنومی می باشد. در این تحقیق، جمعیت های ژنومی برای سطوح مختلف وراثت پذیری (0/05 و 0/2)، عدم تعادل پیوستگی (پایین و بالا) و تعداد متفاوت جایگاه صفات کمی (200 و 600) بر روی 29 کروموزوم شبیه سازی شد. جهت ایجاد نسبت های مختلف فنوتیپ آستانه ای دودویی، فنوتیپ افراد جمعیت مرجع وابسته به اینکه باقی مانده آنها کمتر از ē-1SDe (رویکرد اول) یا 50 درصد افراد جمعیت (رویکرد دوم) باشد کد یک (فنوتیپ نامطلوب) و سایر حیوانات کد صفر (فنوتیپ مطلوب) اختصاص داده شد. برای بهینه سازی پارامترهای ورودی مدل، سطوح مختلف تعداد SNP نمونه گیری شده (100، 1000 و 2000=mtry)، تعداد بوت استراپ (500، 1000 و 2000=ntree) و حداقل اندازه گره پایانی (1 و 5=node size) برای جنگل تصادفی و سطوح مختلف تعداد درخت (100، 1000 و 2000=ntree)، عمق درخت (1، 5 و 10=tc) و نرخ یادگیری (0/1 و 0/05=lc) برای Boosting در نظر گرفته شد. کمترین میزان خطای خارج از کیسه برای mtry برابر با 2000، ntree برابر با 1000 و node size برابر با 1 و کمترین خطای اعتبارسنجی در روش Boosting برای ntree، tc و lr به ترتیب 1000، 10 و 0/05 مشاهده شد. صحت پیش بینی ژنومی روش های جنگل تصادفی و Boosting با کاهش فنوتیپ نامطلوب (رویکرد اول) افزایش یافت. بطور کلی در تمام سناریوها روش Boosting عملکرد بهتری نسبت به روش جنگل تصادفی داشت که دلیل این امر را می توان لحاظ کردن اثرات متقابل بین نشانگرها، خود ترمیمی و قدرت بالای این روش در کاهش خطای مدل دانست.
کلید واژگان: اعتبارسنجی، صفات آستانه ای، عدم تعادل پیوستگی، وراثت پذیری، یادگیری ماشینIntroductionThe development of genotyping technologies has facilitated the genetic progress of breeding programs by implementing genomic selection (GS). In fact, the accuracy of genomic evaluations has been enhanced via GS and quickly spread in livestock breeding. For several decades, most phenotypic variation in dairy cattle populations had focused on continuous traits especially milk yield. From an animal breeding perspective, pay attention to this category of traits because of negative correlation with novel functional traits leads to reduction in genomic merit of these traits. Considerable advances along with increasing economic benefits in modern animal breeding programs requires better understanding and the direct inclusion of novel functional traits. Since many prominent traits in livestock including disease resistance and calving difficulty, present a binary distribution of phenotypes (and are often termed threshold traits), thus these traits are important in animal breeding due to importance of animal welfare and human tendency for healthy and high quality products. Threshold nature of most functional traits, affected by multiple genes, non-compliance from Mendelian inheritance and normal distribution are challenges for accurate prediction of GEBV using statistical methods in such kind of traits. Machine learning methodology as a non-parametric method commonly extended to solve the challenges of genomic selection for threshold traits. Random Forest (RF) and Boosting are powerful machine learning methods in order to recognize gene-gene, protein-protein and gene-environment interactions, to detect disease associated genes, to model the relationship among combinations of markers, to select genes associated with the target trait, to identify the regulatory factors in or protein and DNA sequences, to classify various samples in gene expression of microarrays data and to improve accuracy of genomic prediction. The objective of current study was to investigate the role of threshold phenotype rate of training set and different genomic architecture on performance of RF and Boosting methods. In this regard, per-determined and tuning input parameters of each method is a basic step to achieve maximum genomic accuracy.
Materials and MethodsA population of 2090 animals genotyped for 10,000 markers was simulated using QMSim software. In the first phase, over a time span of 1,000 generations, a historical population was provided from 1045 females and 1045 males. In the second phase, in order to produce a realistic level of LD, bottleneck was used. For this purpose, the population size decreased over 100 generations to 209 individuals. In the third phase, the population size increased over 100 generations (2030 females and 60 males). All 2090 individuals of the last historical generation served as founders and using a random mating design expanded the recent population by simulating an additional 10 generations. During these generations, replacement ratio was set at 0.2 and 0.50 for females and males, respectively and selection of candidate individuals were based on EBV and age. Each mating produced only one offspring with a same probability of being either male or female. Individuals of generations 6 to 9 was used as training set, while the whole generation 10 was considered as validation set. Genomic population were simulated to reflect variations in heritability (0.05 and 0.20), linkage disequilibrium (low and high) and number of QTL (200 and 600) for 29 chromosomes; therefore, four different scenarios including I (10K SNP, h2 = 0.20, LD = low and 600 QTL), II (10K SNP, h2 = 0.2, LD = low and 200 QTL), III (10K SNP, h2 = 0.05, LD = low and 200 QTL) and IV (10K SNP, h2 = 0.05, LD = high and 200 QTL) were simulated. In order to create different rates of discrete phenotype, the animals phenotype of training set was coded as1 (inappropriate phenotype) depending on whether their phenotype residuals was less than the average of residuals ( ) or - 1 for the first and second approachs, respectively, and other individuals was defined ascode 0 (appropriate phenotype). In order to tuning input parameters of the model, different levels of mtry (100, 1000 and 2000), ntree (500, 1000, 2000) and nodesize (1 and 5) for RF and ntree (500, 1000, 2000), tc (1, 5 and 10) and lc (0.1 and 0.05) for Boosting were considered.
Results and DiscussionThe least of out-of-bag (OOB) error was obtained for mtry= 2000, ntree= 1000 and nodesize= 1 in RF method while the least of cross validation (CV) error was observed for boosting method with mtry= 2000, tc= 10 and lc= 0.05. In all scenarios, RF algorithm was showed a wide range of genomic accuracy (0.287 to 0.57) compared to Boosting method (0.4 to 0.58). Accuracy of genomic predicted was decreased in RF and Boosting with increasing the inappropriate phenotype, because of more individuals are in the vicinity of the average normal population for the first approach ( )compared to the second approach ( - 1 ), therefore leads to more classification errors (coding)and decrease of the genomic prediction accuracy. RF and Boosting showed a high performance when high-heritability traits were controlled by a large number of QTLs. Increase in number of QTLs generally led to a major improvement in RF accuracies, while a negligible positive effects were found for Boosting.
ConclusionThe composition of training set and population genomic architecture were two basic factors affecting accuracy of genomic prediction in machine learning methods. Interactions among predictive variables (SNP), self-healing and high potency to decrease training error were considered in Boosting method resulting in more accurate estimation in this method compared to the other RF method under all scenarios
Keywords: Cross validation, Heritability, Linkage disequlibrium, Machine learning, Threshold traits -
در این پژوهش، روش برای پیش بینی فراسنجه های ناشناخته پنج مدل بهترین پیش بینی نااریب خطی ژنگانی (ژنومی G-BLUP) از روش بیز و نمونهگیری گیبس استفاده شد. در هر مدل از مقیاس های متفاوتی برای ماتریس G شامل استفاده از فراوانی آللی جمعیت بنیان گذار (Gfoun)، فراوانی آللی جمعیت مرجع (Gref)، فراوانی آللی برابر با 5/0 (G05)، یک ماتریس نرمال شده با میانگین عنصرهای قطری برابر با یک (Gnorm) و یک ماتریس G وزن شده با ماتریس A (Gwei)، استفاده شد. برای مقایسه نتایج از یک جمعیت دارای آمیزش تصادفی و یک جمعیت انتخاب شده، برای صفتی با وراثت پذیری 25/0 روی یک ژنگان با QTL 105 و 3000 نشانگر تک نوکلئوتیدی روی سه کروموزوم استفاده شد. نتایج نشان داد، عنصرهای ماتریسهای G در مقایسه با ماتریس A واریانس بالاتری دارند. میانگین عنصرهای قطری و غیر قطری به غیراز Gnorm و Gwei از عنصرهای متناظر در A بالاتر بودند. روشهای Gnorm-BLUP و G05-BLUP در مقایسه با سه روش دیگر منجر به برآورد متورم واریانس ژنتیکی شدند که این تورم در جمعیت انتخاب شده کمتر بود. میانگین درستی پنج مدل G-BLUP در جمعیت تصادفی 084/0 بالاتر (736/0 در مقابل 652/0) از جمعیت انتخاب شده و میانگین اریبی 014/0 پایینتر (026/0 در مقابل 04/0) بود. اریبی پیش بینی ارزش اصلاحی حقیقی جمعیت انتخاب شده با استفاده از Gwei نزدیک به صفر ولی با Gref بیشتر از 06/0 بود. بیشترین درستی و کمترین اریب میتواند با استفاده از فراوانی آللی جمعیت مرجع که با ماتریس A مقیاس شدهاند، به دست آید.کلید واژگان: اعتبارسنجی متقابل، پیش بینی ژنگانی، روش بیز، فراوانی آللی، قابلیت پیش بینیIn this study, Bayesian approach via Gibbs sampling was used to predict unknown parameters of five equivalent Genomic Best Linear Unbiased Predictions (G-BLUP), each with different scale of G matrix by using allele frequency of founder population (Gfoun), allele frequency of reference population (Gref), allele frequency equal to 0.5 (G05), a normalized matrix with average diagonal coefficients equal to 1 (Gnorm) and a weighted G matrix with A matrix (Gwei). A random mating population and a selected population were used to compare results of a trait with heritability of 0.25 on a genome constructed of three chromosomes with 105 QTLs and 3000 single nucleotide polymorphisms. The results showed that higher variance existed in the elements of G matrices compared with A matrix. Average diagonal and off-diagonal elements except Gnorm and Gwei were higher than corresponding elements in A. Gnorm-BLUP and G05-BLUP methods led to inflated genetic variance in contrast other three methods and this inflation was lower in selected population. Average accuracy over 5 G-BLUP in random population was 0.084 higher than selected population (0.762 vs. 0.652) and bias was 0.041 lower (0.026 vs. 0.04). Bias of prediction of true breeding value of selected population by using Gwei almost was zero but with Gref greater than 0.06. The greatest accuracy and the smallest bias can be obtained by using allele frequency of reference population that re-scaled with A matrix.Keywords: Allele frequency, bayesian approach, cross validation, genomic prediction, predictive ability
- نتایج بر اساس تاریخ انتشار مرتب شدهاند.
- کلیدواژه مورد نظر شما تنها در فیلد کلیدواژگان مقالات جستجو شدهاست. به منظور حذف نتایج غیر مرتبط، جستجو تنها در مقالات مجلاتی انجام شده که با مجله ماخذ هم موضوع هستند.
- در صورتی که میخواهید جستجو را در همه موضوعات و با شرایط دیگر تکرار کنید به صفحه جستجوی پیشرفته مجلات مراجعه کنید.