Bridging the auditability gap in consumer medical AI: design and public deployment of a Georgian-language, evidence-based symptom triage system (SheniEkimi) with 150 guideline-anchored scenarios and 29 validated clinical decision rules

Authors

DOI:

https://doi.org/10.66636/gmj.v1.i2.a113

Keywords:

AI hallucination, ICD-11, AGREE II

Abstract

Background  The rapid adoption of consumer-facing generative artificial intelligence (AI) chatbots for self-directed medical advice has exposed users to systematic risks including fabricated citations (“hallucinations,” reported at 28–47% across medical queries), output instability between sessions, and an absence of auditable clinical reasoning. Georgian-speaking populations face compounded risk because commercial AI interfaces offer poor-quality localisation and because no evidence-based, Georgian-language symptom triage tool previously existed. We report the development and public deployment of SheniEkimi, a non-commercial, evidence-anchored alternative.

Objectives  To design, implement, and publicly deploy a Georgian-language, bilingual (Georgian/English), auditable symptom triage system anchored exclusively in named international clinical guidelines and validated clinical decision rules (CDRs), with a transparent methodology reproducible by independent teams.

Methods  We constructed a 16-body-system framework adapted from WHO IMAI/IMCI and ICD-11, and selected 150 clinical scenarios meeting four inclusion criteria: assessable without in-hospital diagnostics, backed by an evidence-based triage threshold from an international guideline or externally-validated CDR, mappable to ICD-11, and common in primary-care contexts. Sources were appraised using an AGREE II–informed rubric. Twenty-nine CDRs meeting four quality criteria (peer-reviewed derivation; external validation; explicit thresholds; primary-care applicability) were implemented with full multi-question weighted scoring. The tool was implemented as a single-file WordPress plugin with Apple-Health-inspired interface, Georgian-primary bilingual architecture, runtime language toggling, and full offline capability. No identifiable personal data is collected or transmitted.

Results  Of 150 scenarios, 69 (46.0%) map to emergency output (112 dispatch), 75 (50.0%) to same-day primary care, and 6 (4.0%) to home self-care with explicit safety-netting. Cardiovascular (n=18), respiratory (n=13), neurological (n=13), gastrointestinal (n=13), and paediatric (n=12) domains together account for 45.3% of all scenarios, reflecting disease-burden weighting and density of validated CDRs. Sources span 39 international guidelines from WHO, NICE, AHA/ACC, ESC, BTS/SIGN, IDSA, ADA, AAP, AGS, RCOG, EAU, GOLD, and specialty consortia, plus 29 externally-validated CDRs. Deterministic validation confirms that every CDR’s scoring chain produces valid triage output across its full range. Full source code, methodology, and scenario registry are published under CC BY 4.0.

Conclusions  SheniEkimi is the first published Georgian-language symptom triage system anchored in peer-reviewed international evidence, and demonstrates that a volunteer-led public-health organisation can produce a transparent, auditable alternative to opaque proprietary symptom-checkers and to generative-AI medical-advice interfaces. The methodology is reproducible; the implementation is open; the operating model precludes commercial drift. The framework is offered as a template for other regional or minority-language digital health initiatives facing the same gap.

Keywords  clinical decision support; symptom triage; evidence-based medicine; digital health; artificial intelligence; AI hallucination; mHealth; Georgia; primary care; AGREE II; ICD-11; health equity.

References

1. Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–96. https://doi.org/10.1001/jamainternmed.2023.1838

2. Chen S, Kann BH, Foote MB, Aerts HJWL, Savova GK, Mak RH, et al. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023;9(10):1459–62. https://doi.org/10.1001/jamaoncol.2023.2954

3. Bhattacharyya M, Miller VM, Bhattacharyya D, Miller LE. High rates of fabricated and inaccurate references in ChatGPT-generated medical content. Cureus. 2023;15(5):e39238. https://doi.org/10.7759/cureus.39238

4. Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13:14045. https://doi.org/10.1038/s41598-023-41032-5

5. Howard A, Hope W, Gerada A. ChatGPT and antimicrobial advice: the end of the consulting infection doctor? Lancet Infect Dis. 2023;23(4):405–6. https://doi.org/10.1016/S1473-3099(23)00113-5

6. Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel). 2023;11(6):887. https://doi.org/10.3390/healthcare11060887

7. Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):1–38. https://doi.org/10.1145/3571730

8. National Center for Disease Control and Public Health of Georgia. Health statistics yearbook of Georgia 2023. Tbilisi: NCDC; 2024.

9. Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ. 2015;351:h3480. https://doi.org/10.1136/bmj.h3480

10. Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage accuracy of symptom checker apps: 5-year follow-up evaluation. J Med Internet Res. 2022;24(5):e31810. https://doi.org/10.2196/31810

11. Gilbert S, Mehl A, Baluch A, Cawley C, Challiner J, Fraser H, et al. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? BMJ Open. 2020;10(12):e040269. https://doi.org/10.1136/bmjopen-2020-040269

12. World Health Organization. Integrated Management of Adolescent and Adult Illness (IMAI): acute care guidelines. Geneva: WHO; 2009.

13. World Health Organization. Integrated Management of Childhood Illness (IMCI) chart booklet. Geneva: WHO; 2014.

14. World Health Organization. ICD-11 for Mortality and Morbidity Statistics (version 2024-01). Geneva: WHO; 2024. https://icd.who.int/

15. Brouwers MC, Kho ME, Browman GP, Burgers JS, Cluzeau F, Feder G, et al. AGREE II: advancing guideline development, reporting and evaluation in health care. CMAJ. 2010;182(18):E839–42. https://doi.org/10.1503/cmaj.090449

16. Backus BE, Six AJ, Kelder JC, Mast TP, van den Akker F, Mast EG, et al. Chest pain in the emergency room: a multicenter validation of the HEART score. Crit Pathw Cardiol. 2010;9(3):164–9. https://doi.org/10.1097/HPC.0b013e3181ec36d8

17. Wells PS, Anderson DR, Rodger M, Forgie M, Kearon C, Dreyer J, et al. Evaluation of D-dimer in the diagnosis of suspected deep-vein thrombosis. N Engl J Med. 2003;349(13):1227–35. https://doi.org/10.1056/NEJMoa023153

18. Wells PS, Anderson DR, Rodger M, Stiell I, Dreyer JF, Barnes D, et al. Excluding pulmonary embolism at the bedside without diagnostic imaging. Ann Intern Med. 2001;135(2):98–107. https://doi.org/10.7326/0003-4819-135-2-200107170-00010

19. Kline JA, Mitchell AM, Kabrhel C, Richman PB, Courtney DM. Clinical criteria to prevent unnecessary diagnostic testing in emergency department patients with suspected pulmonary embolism. J Thromb Haemost. 2004;2(8):1247–55. https://doi.org/10.1111/j.1538-7836.2004.00790.x

20. Lip GY, Nieuwlaat R, Pisters R, Lane DA, Crijns HJ. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the Euro Heart Survey on Atrial Fibrillation. Chest. 2010;137(2):263–72. https://doi.org/10.1378/chest.09-1584

21. Pisters R, Lane DA, Nieuwlaat R, de Vos CB, Crijns HJ, Lip GY. A novel user-friendly score (HAS-BLED) to assess 1-year risk of major bleeding in patients with atrial fibrillation. Chest. 2010;138(5):1093–100. https://doi.org/10.1378/chest.10-0134

22. Thiruganasambandamoorthy V, Kwong K, Wells GA, Sivilotti MLA, Mukarram M, Rowe BH, et al. Development of the Canadian Syncope Risk Score to predict serious adverse events after emergency department assessment of syncope. CMAJ. 2016;188(12):E289–E298. https://doi.org/10.1503/cmaj.151469

23. Royal College of Physicians. National Early Warning Score (NEWS 2): standardising the assessment of acute-illness severity in the NHS. London: RCP; 2017. https://www.rcp.ac.uk/improving-care/resources/national-early-warning-score-news-2/

24. Lim WS, van der Eerden MM, Laing R, Boersma WG, Karalus N, Town GI, et al. Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study. Thorax. 2003;58(5):377–82. https://doi.org/10.1136/thorax.58.5.377

25. Chung F, Yegneswaran B, Liao P, Chung SA, Vairavanathan S, Islam S, et al. STOP questionnaire: a tool to screen patients for obstructive sleep apnea. Anesthesiology. 2008;108(5):812–21. https://doi.org/10.1097/ALN.0b013e31816d83e4

26. Aroor S, Singh R, Goldstein LB. BE-FAST (Balance, Eyes, Face, Arm, Speech, Time): reducing the proportion of strokes missed using the FAST mnemonic. Stroke. 2017;48(2):479–81. https://doi.org/10.1161/STROKEAHA.116.015169

27. Johnston SC, Rothwell PM, Nguyen-Huynh MN, Giles MF, Elkins JS, Bernstein AL, et al. Validation and refinement of scores to predict very early stroke risk after transient ischaemic attack. Lancet. 2007;369(9558):283–92. https://doi.org/10.1016/S0140-6736(07)60150-0

28. Perry JJ, Stiell IG, Sivilotti MLA, Bullard MJ, Hohl CM, Sutherland J, et al. Clinical decision rules to rule out subarachnoid hemorrhage for acute headache. JAMA. 2013;310(12):1248–55. https://doi.org/10.1001/jama.2013.278018

29. Bellelli G, Morandi A, Davis DH, Mazzola P, Turco R, Gentile S, et al. Validation of the 4AT, a new instrument for rapid delirium screening: a study in 234 hospitalised older people. Age Ageing. 2014;43(4):496–502. https://doi.org/10.1093/ageing/afu021

30. Alvarado A. A practical score for the early diagnosis of acute appendicitis. Ann Emerg Med. 1986;15(5):557–64. https://doi.org/10.1016/S0196-0644(86)80993-3

31. Blatchford O, Murray WR, Blatchford M. A risk score to predict need for treatment for upper-gastrointestinal haemorrhage. Lancet. 2000;356(9238):1318–21. https://doi.org/10.1016/S0140-6736(00)02816-6

32. Stiell IG, Greenberg GH, McKnight RD, Nair RC, McDowell I, Worthington JR. A study to develop clinical decision rules for the use of radiography in acute ankle injuries. Ann Emerg Med. 1992;21(4):384–90. https://doi.org/10.1016/S0196-0644(05)82656-3

33. Stiell IG, Greenberg GH, Wells GA, McKnight RD, Cwinn AA, Cacciotti T, et al. Derivation of a decision rule for the use of radiography in acute knee injuries. Ann Emerg Med. 1995;26(4):405–13. https://doi.org/10.1016/S0196-0644(95)70106-0

34. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). JAMA. 2016;315(8):801–10. https://doi.org/10.1001/jama.2016.0287

35. McIsaac WJ, White D, Tannenbaum D, Low DE. A clinical score to reduce unnecessary antibiotic use in patients with sore throat. CMAJ. 1998;158(1):75–83. PMID: 9475915.

36. Barry MJ, Fowler FJ, O’Leary MP, Bruskewitz RC, Holtgrewe HL, Mebust WK, et al. The American Urological Association symptom index for benign prostatic hyperplasia. J Urol. 1992;148(5):1549–57. https://doi.org/10.1016/S0022-5347(17)36966-5

37. Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606–13. https://doi.org/10.1046/j.1525-1497.2001.016009606.x

38. Spitzer RL, Kroenke K, Williams JB, Löwe B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch Intern Med. 2006;166(10):1092–7. https://doi.org/10.1001/archinte.166.10.1092

39. Posner K, Brown GK, Stanley B, Brent DA, Yershova KV, Oquendo MA, et al. The Columbia-Suicide Severity Rating Scale: initial validity and internal consistency findings from three multisite studies with adolescents and adults. Am J Psychiatry. 2011;168(12):1266–77. https://doi.org/10.1176/appi.ajp.2011.10111704

40. Sullivan JT, Sykora K, Schneiderman J, Naranjo CA, Sellers EM. Assessment of alcohol withdrawal: the revised Clinical Institute Withdrawal Assessment for Alcohol scale (CIWA-Ar). Br J Addict. 1989;84(11):1353–7. https://doi.org/10.1111/j.1360-0443.1989.tb00737.x

41. Kuppermann N, Holmes JF, Dayan PS, Hoyle JD, Atabaki SM, Holubkov R, et al. Identification of children at very low risk of clinically-important brain injuries after head trauma: a prospective cohort study. Lancet. 2009;374(9696):1160–70. https://doi.org/10.1016/S0140-6736(09)61558-0

42. Westley CR, Cotton EK, Brooks JG. Nebulized racemic epinephrine by IPPB for the treatment of croup: a double-blind study. Am J Dis Child. 1978;132(5):484–7. https://doi.org/10.1001/archpedi.1978.02120300044008

43. Chalut DS, Ducharme FM, Davis GM. The Preschool Respiratory Assessment Measure (PRAM): a responsive index of acute asthma severity. J Pediatr. 2000;137(6):762–8. https://doi.org/10.1067/mpd.2000.110121

44. Borson S, Scanlan J, Brush M, Vitaliano P, Dokmak A. The Mini-Cog: a cognitive “vital signs” measure for dementia screening in multi-lingual elderly. Int J Geriatr Psychiatry. 2000;15(11):1021–7. https://doi.org/10.1002/1099-1166(200011)15:11<1021::AID-GPS234>3.0.CO;2-6

45. International Committee of Medical Journal Editors. Recommendations for the conduct, reporting, editing and publication of scholarly work in medical journals. Updated January 2024. https://www.icmje.org/recommendations/

Downloads

Published

05/09/2026

How to Cite

Apshinashvili, I., & Pkhakadze, G. (2026). Bridging the auditability gap in consumer medical AI: design and public deployment of a Georgian-language, evidence-based symptom triage system (SheniEkimi) with 150 guideline-anchored scenarios and 29 validated clinical decision rules. Georgian Medical Journal, 1(2), 1–12. https://doi.org/10.66636/gmj.v1.i2.a113

Most read articles by the same author(s)

1 2 3 4 5 6 7 8 9 > >>