I assess AI-generated responses with a linguist's eye — identifying biases, verifying factual accuracy, and delivering structured feedback that helps language models improve.
View my work →About
With a degree in Languages and a Master's in Anglophone Literature, I bring deep linguistic expertise to the evaluation of AI systems. I have worked across multiple platforms assessing model outputs, annotating training data, and ensuring quality at every step of the pipeline.
Laurea in Lingue, Master's thesis on Anglophone Literature. Native Italian, fluent English. Experienced in translation quality assessment and localization QA.
3+ years evaluating LLM responses across platforms including Outlier AI, Appen, Centific, and WeLo. Specialized in bias detection, factual accuracy, and structured feedback.
Bug reporting and web/app testing via TestIO and BetaTesting. Detailed, reproducible reports. Comfortable with developer tools, device testing, and structured QA workflows.
Full-stack development (HTML, CSS, JavaScript, SQL) via Mimo. Building technical skills to complement linguistic expertise for senior evaluation roles.
Skills
A combination of linguistic precision and methodological rigour.
Accuracy, relevance, safety, tone, and completeness assessment
Authority, fluency, length, and personal experience biases
Clear, timestamped, actionable feedback for model improvement
Italian localization review, translation quality assessment
Structured, reproducible reports for web and mobile apps
TTS evaluation, audio tagging, and Italian transcription
Projects
Concrete examples of my evaluation methodology and output quality.
Evaluation Framework
A structured scoring rubric for assessing AI evaluator competencies. Built from first principles, it covers five observable dimensions — each with a 1–4 scale and concrete anchor behaviours.
| # | Criterion | What it measures |
|---|---|---|
| 1 | Bias Identification | Names ≥2 biases + countermeasure |
| 2 | Guideline Adherence | Cites criterion + justifies score |
| 3 | Feedback Quality | Timestamps + issue + solution |
| 4 | Score Calibration | Consistency + justified differences |
| 5 | Metacognition | Self-awareness of personal biases |
| Score | Label | Description (Criterion 1 example) |
|---|---|---|
| 4 | Exemplary | Names ≥2 biases, explains impact, provides ≥1 countermeasure with concrete example |
| 3 | Advanced | Names ≥2 biases, explains impact, but countermeasure is inadequate or absent |
| 2 | Beginner | Names only 1 bias or does not explain its impact on the evaluation |
| 1 | Inadequate | Does not name specific biases, vague or absent response |
Feedback Sample
Examples of structured evaluator feedback on AI-generated responses. Each piece of feedback cites specific text with timestamps, identifies the issue, and proposes a concrete improvement — following the methodology outlined in the rubric above.
Issue: The model states that "the population of Rome is approximately 4 million" without qualifying the source or date. This is factually imprecise — the city proper has approximately 2.8 million residents (ISTAT 2023).
Suggested fix: Replace with "approximately 2.8 million (ISTAT 2023)" or acknowledge uncertainty with "estimates vary between 2.8M and 4.3M depending on metropolitan area definition."
Issue: Length bias risk — this section is 340 words but adds no new information beyond what was stated in the first 80 words. A evaluator may incorrectly interpret verbosity as depth.
Suggested fix: Condense to 80–100 words, keeping only the two core arguments. This also improves fluency and clarity scores.
Experience
3+ years across AI training, linguistic evaluation, and quality assurance.
LLM response assessment, prompt evaluation, written feedback for model training
Italian text annotation, translation quality assessment, localization QA
Web and mobile app testing, bug reporting, exploratory and structured test cycles
TTS evaluation, Italian voice recording, audio tagging and transcription
Intercultural mediation, community linguistic support in Sardinia
I'm available for AI evaluation contracts, linguistic QA projects, and data annotation roles — remote, full-time or part-time.