Anita Brau — AI Linguistic Evaluator

About

Language meets machine intelligence

With a degree in Languages and a Master's in Anglophone Literature, I bring deep linguistic expertise to the evaluation of AI systems. I have worked across multiple platforms assessing model outputs, annotating training data, and ensuring quality at every step of the pipeline.

Linguistic background

Laurea in Lingue, Master's thesis on Anglophone Literature. Native Italian, fluent English. Experienced in translation quality assessment and localization QA.

AI evaluation expertise

3+ years evaluating LLM responses across platforms including Outlier AI, Appen, Centific, and WeLo. Specialized in bias detection, factual accuracy, and structured feedback.

QA & testing

Bug reporting and web/app testing via TestIO and BetaTesting. Detailed, reproducible reports. Comfortable with developer tools, device testing, and structured QA workflows.

Currently learning

Full-stack development (HTML, CSS, JavaScript, SQL) via Mimo. Building technical skills to complement linguistic expertise for senior evaluation roles.

Skills

What I bring to every evaluation

A combination of linguistic precision and methodological rigour.

🔍

LLM Response Evaluation

Accuracy, relevance, safety, tone, and completeness assessment

⚖️

Bias Detection

Authority, fluency, length, and personal experience biases

✍️

Written Feedback

Clear, timestamped, actionable feedback for model improvement

🌐

Linguistic QA

Italian localization review, translation quality assessment

🐛

Bug Reporting

Structured, reproducible reports for web and mobile apps

🎙️

Audio & Transcription

TTS evaluation, audio tagging, and Italian transcription

Projects

Work samples

Concrete examples of my evaluation methodology and output quality.

Evaluation Framework

AI Evaluator Assessment Rubric

A structured scoring rubric for assessing AI evaluator competencies. Built from first principles, it covers five observable dimensions — each with a 1–4 scale and concrete anchor behaviours.

Rubric Design Bias Awareness Metacognition Score Calibration

#	Criterion	What it measures
1	Bias Identification	Names ≥2 biases + countermeasure
2	Guideline Adherence	Cites criterion + justifies score
3	Feedback Quality	Timestamps + issue + solution
4	Score Calibration	Consistency + justified differences
5	Metacognition	Self-awareness of personal biases

Score	Label	Description (Criterion 1 example)
4	Exemplary	Names ≥2 biases, explains impact, provides ≥1 countermeasure with concrete example
3	Advanced	Names ≥2 biases, explains impact, but countermeasure is inadequate or absent
2	Beginner	Names only 1 bias or does not explain its impact on the evaluation
1	Inadequate	Does not name specific biases, vague or absent response

View full rubric on GitHub →

Feedback Sample

Written Feedback Examples

Examples of structured evaluator feedback on AI-generated responses. Each piece of feedback cites specific text with timestamps, identifies the issue, and proposes a concrete improvement — following the methodology outlined in the rubric above.

Written Feedback Timestamped Citations Actionable Italian · English

[00:14–00:22]

Issue: The model states that "the population of Rome is approximately 4 million" without qualifying the source or date. This is factually imprecise — the city proper has approximately 2.8 million residents (ISTAT 2023).

Suggested fix: Replace with "approximately 2.8 million (ISTAT 2023)" or acknowledge uncertainty with "estimates vary between 2.8M and 4.3M depending on metropolitan area definition."

[01:05–01:18]

Issue: Length bias risk — this section is 340 words but adds no new information beyond what was stated in the first 80 words. A evaluator may incorrectly interpret verbosity as depth.

Suggested fix: Condense to 80–100 words, keeping only the two core arguments. This also improves fluency and clarity scores.

Experience

Platforms & roles

3+ years across AI training, linguistic evaluation, and quality assurance.

2023–present

AI Response Evaluator — Outlier AI, WeLo Data

LLM response assessment, prompt evaluation, written feedback for model training

2022–present

Data Annotator & Linguistic Reviewer — Appen, Centific

Italian text annotation, translation quality assessment, localization QA

2022–present

QA Tester — TestIO, BetaTesting

Web and mobile app testing, bug reporting, exploratory and structured test cycles

2021–2023

Audio Specialist — CrowdGen, Firstsource

TTS evaluation, Italian voice recording, audio tagging and transcription

2020–2022

Cultural Mediator — PNRR Projects (×2)

Intercultural mediation, community linguistic support in Sardinia

AI Linguistic Evaluator & Quality Specialist