Search Header Logo
ISWS 2025 - Human-centric AI Evaluation

ISWS 2025 - Human-centric AI Evaluation

Assessment

Presentation

Computers

University

Practice Problem

Hard

Created by

Irene Celino

Used 6+ times

FREE Resource

18 Slides • 11 Questions

1

Human-centric AI Evaluation

Irene Celino - Cefriel - irene.celino@cefriel.com

International Semantic Web Research Summer School - Bertinoro (Italy) - June 10th 2025

media
media

2

Multiple Choice

Question image

Ice-breaker quiz: Which of the following Knowledge Graphs is well-known and widely used?

1

DBpasta

2

GeoMemes

3

Wikidata

4

OpenPHACToids

3

Why Knowledge Graphs still matter in the age of AI

“Scientific-technical” reasons

  • (Other) AI for KG: use of AI to automate or augment KG construction

  • KG for (other) AI: use of KG to train AI and to ground AI answers/predictions

  • KG and (other) AI are complementary – KG ground AI, AI helps scale KG


“Business” reasons

  • KG (and other AI) have a role in any knowledge-intensive tasks

  • KG (and other AI) can support knowledge workers

KW

4

Multiple Choice

Question image

Quiz: Who is a Knowledge Worker?

1

A person selling knowledge

2

A person changed by the information they process

3

A person applying knowledge to manual work

4

A job title for a know-it-all

5

Definitions of knowledge worker

The knowledge worker puts to work what he has learned in systematic education, that is, concepts, ideas and theories, rather than the man who puts to work manual skill or muscle

Peter Drucker, Management: Tasks, Responsibilities and Practices, 1974


The defining characteristic of knowledge workers is that they are themselves changed by the information they process. (To some extent, this is true of any human being, What distinguishes knowledge workers is that this is their primary motivation and the job they are paid to do)

Allison Kidd, The Marks are on the Knowledge Worker, CHI 1994


Knowledge workers are individuals whose primary job involves working with information, developing knowledge, and making decisions that drive productivity and innovation. […] Knowledge workers are the most valuable assets of the modern economy, contributing to the growth and competitiveness of organizations across various industries

Peter Drucker, Management challenges for the 21st century, 1999

MAXP

6

Multiple Choice

Question image

Quiz: A recent work by the Max-Plank Institute used KGs and LLMs to generate…

1

A sentient AI that now wants tenure

2

Paper extraction pipelines

3

Better grant proposals

4

Novel research ideas

7

Are the generated research ideas actually interesting and relevant?

  • Humans (110 research group leaders) expressed interest level on generated ideas

  • Manual annotations were used for predicting relevant generated ideas

  • Results: (1) some AI-generated ideas were genuinely compelling, (2) human feedback is crucial for aligning AI outputs with human expectations

X. Gu and M. Krenn: Interesting Scientific Idea Generation using Knowledge Graphs and LLMs: Evaluations with 100 Research Group Leaders, 2024

media

8

Why Evaluation Matters: From Performance to Human-Centeredness

  • Traditional metrics like accuracy and F1 score are not enough. We need to evaluate how AI affects humans: trust, understanding, and decision-making

  • Where to find suitable metrics to evaluate AI adopting a human-centered perspective?

    • XAI literature: e.g. human decision accuracy, fairness, trust, understanding

    • Social sciences!!

      • Subjective evaluation: e.g. perceived quality of results, perceived usefulness, etc.

J. Ma et al. “OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning, 2024



T. Miller, P. Howe, L. Sonenberg "Explainable AI: Beware of inmates running the asylum or: How I learnt to stop worrying and love the social and behavioural sciences", 2017

T. Miller "Explanation in artificial intelligence: Insights from the social sciences“, 2018

PRKS

9

Multiple Choice

Question image

Quiz: Assessing LLM extraction from text for procedural KG creation,
some human evaluators said that…

1

The LLM results were flawless

2

The steps were “creatively” ordered

3

Procedures were sound, but emotionally confusing

4

They would have done better than the LLM

10

Expected (OR unexpected) results from human evaluation of AI

  • “Fitness for use” (perceived usefulness) is often more important than perceived quality of AI output

V. Carriero, I. Baroni, M. Scrocca, A. Azzini and I. Celino: Human Evaluation of Procedural Knowledge Graph Extraction from Text with Large Language Models, EKAW, 2024

I. Baroni, G. Re Calegari, D. Scandolari, I. Celino: AI-TAM: a model to investigate user acceptance and collaborative intention in human-in-the-loop AI applications", HCJ, 2022

C. Longoni, A. Bonezzi, C. Morewedge: Resistance To Medical Artificial Intelligence, JCR, 2019

HHJM

  • Pretty high perceived quality of AI output, but still “prejudice” that humans could do better

  • Human touch” may still be preferred even when potentially risky

11

Multiple Choice

Question image

Quiz: Who do you blame more for a mistake, humans or machines?

1

It depends on the context

2

Humans, imperfect by design

3

Machines, always with bugs

4

No one, mistakes build character (and datasets)

12

A/B testing on how humans judge machines

Context-dependent preference for “humans” or for “machines”
to execute the same task in the same scenario

Who was perceived to be more at fault for injuring the pedestrian?

  • The human driver

  • The driverless car

C. Hidalgo et al.: How Humans Judge Machines, 2021

media
media

Who was perceived to be more at fault for misusing the national flag?

  • The human cleaner

  • The cleaning robot

13

When Do We Believe the “Machine”? e.g. ChatGPT

  • Trust (which is a multidimensional construct) in ChatGPT was influenced by pragmatic factors (usefulness, speed) and hedonic factors (entertainment, novelty)

M. Huschens et al.: Do You Trust ChatGPT? - Perceived Credibility of Human and AI-Generated Content, 2023

J. Buchanan, W. Hickman: Do people trust humans more than ChatGPT?, JBE, 2024

Y. Jung et al.: Do We Trust ChatGPT as much as Google Search and Wikipedia?, CHI 2024

EXPL

  • Perceived credibility of ChatGPT answers did NOT correlate with actual correctness

  • Users may trust ChatGPT more for factual or technical tasks, but less for emotional or ethical judgments

14

Multiple Choice

Question image

Quiz: In which cases does human reliance increase when
explanations are added to LLM answers?

1

Only on correct LLM answers

2

Only on incorrect LLM answers

3

On both correct and incorrect LLM answers

4

Never, explanations just confuse people more

15

AI (LLM) design choices to shape users’ trust

  • Inconsistencies in explanations also reduce reliance, suggesting that users are sensitive to logical coherence

S. Kim et al.: Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies, CHI 2025

another relevant study with similar results:
M. Sadeghi et al. Explaining the Unexplainable: The Impact of Misleading Explanations on Trust in Unreliable Predictions for Hardly Assessable Tasks, UMAP 2024

  • Explanations increase reliance - even on incorrect answers! This shows that explanations can be persuasive, regardless of accuracy

  • Sources reduce overreliance on incorrect answers, helping users calibrate their trust

16

What’s in an explanation, after all?

  • Knowledge Graphs (and Semantic Web technologies at large) can and should have a big role in explanations!

B. Mittelstadt, C. Russell, S. Wachter: Explaining explanations in AI, 2019
S. Chari et al.: Explanation ontology: A model of explanations for user-centered AI, ISWC 2020

F. Lecue: On the role of knowledge graphs in explainable AI, SWJ, 2019

I. Celino: Who is this Explanation for? Human Intelligence and Knowledge Graphs for eXplainable AI, 2020

CONF

  • Explanation as “what the human wants to know” opposed to “scientific explanation of the AI model internal processing”

  • Human explanation are usually: contrastive (why P and not Q?), selective (not all possible causes, only "relevant" ones), social (dialogue, interaction, iteration)

17

Multiple Choice

Question image

Quiz: In human-AI collaboration, what is the relationship between
human self-confidence and AI confidence?

1

Human confidence is higher than AI confidence

2

Human confidence aligns with AI confidence

3

Human confidence is lower than AI confidence

4

Human and AI confidences are not correlated

18

Calibrating confidence, avoiding both over-trust and under-trust

  • Users’ self-confidence tends to align with the AI’s expressed confidence, leading to miscalibrated self-confidence, especially if the AI is overconfident or underconfident --> AI confidence should be carefully calibrated, especially in high-stakes or collaborative contexts

J. Li et al.: As Confidence Aligns: Exploring the Effect of AI Confidence on Human Self-confidence in Human-AI Decision Making, CHI 2025

  • Designers should consider how AI behavior shapes human behavior, not just task outcomes --> greater attention to the cognitive alignment between humans and AI - not just functional alignment

19

Theory of Mind and Social Intelligence

  • Theory of Mind is the ability to attribute mental states - beliefs, intents, desires, emotions, knowledge - to oneself and others and to understand that others have beliefs, desires, and intentions that are different from one's own

D. Premack, G. Woodruff: Does the chimpanzee have a theory of mind?, Behavioral and Brain Sciences, 1978

TOM

  • Social intelligence is the ability to reason about others’ beliefs, intentions, and actions

    • Social intelligence is a critical dimension of human intelligence

    • Does AI have social intelligence?

20

Multiple Choice

Question image

Quiz: Does AI demonstrate a Theory of Mind (ToM) comparable to humans?

1

No, AI lacks any social reasoning

2

Yes, but only at a basic level

3

Yes, AI shows high-order ToM

4

Only when prompted with emotional emojis

21

AI Social Intelligence and “Reasoning”

  • Evaluation tasks:

    • Inverse Reasoning (IR): Inferring the beliefs or goals of others based on their actions

    • Inverse Inverse Planning (IIP): A more complex task involving recursive reasoning about others’ reasoning

  • Results:

    • Humans consistently outperformed GPT models across all tasks.

    • GPT models showed only basic (order-0) social reasoning, while humans demonstrated higher-order (≥2) reasoning

    • LLMs often relied on pattern recognition shortcuts rather than genuine social inference

J. Wang et al.: Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities, 2024

another interesting work:
G. Riva et al.: Psychomatics - A Multidisciplinary Framework for Understanding Artificial Minds, 2024

CRIT

22

Multiple Choice

Question image

Quiz: When do Knowledge Workers increase their Critical Thinking?

1

When the AI makes obvious mistakes

2

When they trust the AI

3

When they trust their own judgment

4

After their third coffee and a motivational quote

23

AI and Critical Thinking

  • GenAI shifts critical thinking toward information verification and response integration

H. Lee et al.: The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers, CHI 2025

  • Higher confidence in GenAI is associated with less critical thinking

  • Higher self-confidence is associated with more critical thinking

24

Multiple Choice

Question image

Quiz: In a recent questionnaire, after testing AI tools in their daily job,

industry workers declared that...

1

They have a high trust in AI

2

They are not confident they will learn to use AI

3

They find the cognitive load to use AI very low

4

They are scared their job will be replaced by AI

25

AI, human factors and ethical principles

  • Ethical principles, guidelines and regulations (e.g. AI Act, HLEG Ethics Guidelines for Trustworthy AI) call for a careful assessment of human factors in relation to the adoption of AI solutions!

A. Azzini, I. Baroni, I. Celino: Assessing human factors in AI adoption by employees: a composite questionnaire for subjective user evaluation, TCAI@HHAI 2025

  • Fear of job replacement by AI, low trust in AI, and increased cognitive load are real issues to take into account when designing AI tools

  • The large popularity and low entry barrier of tools like ChatGPT make industry employees have the perception that AI is easy to learn

26

Human-AI Collaboration!!!

H. Li at al. Why is AI not a Panacea for Data Workers? An Interview Study on Human-AI Collaboration in Data Storytelling, TVCG 2025

media

27

Take-home messages

  1. Always ask yourself who will use your AI system
    (and design it with the target users in mind)

  2. Always perform human evaluation of AI systems!
    (even only at qualitative, small-scale level)

  3. Whenever possible, design human-in-the-loop AI systems
    (the future is in human-AI collaboration, getting the best of both)

​Bonus point: Always challenge the way you are using AI! (video: How Stanford Teaches AI-Powered Creativity https://www.youtube.com/watch?v=wv779vmyPVY)

28

Multiple Choice

Question image

Final quiz: In preparing this tutorial, Irene did NOT rely on AI for…

1

Coming up with wrong quiz answers

2

Picking the images

3

Polishing her phrasing

4

Crafting the tutorial’s storyline

29

​Irene Celino - irene.celino@cefriel.com
Cefriel - viale Sarca 226, 20126 Milano - Italy

(images from Unsplash)

Thank you for your participation!

media
media

Human-centric AI Evaluation

Irene Celino - Cefriel - irene.celino@cefriel.com

International Semantic Web Research Summer School - Bertinoro (Italy) - June 10th 2025

media
media

Show answer

Auto Play

Slide 1 / 29

SLIDE