
Dynamic LLM Evaluator & RLHF Data Specialist at Surge AI, adept in analytical reasoning and data interpretation. Authored high-fidelity failure examples and refined agent behavior, enhancing tool-use efficiency. Proven track record in performance assessment and project evaluation, driving impactful results in complex environments.
Agentic AI Training & Golden Trajectories: Authored and edited multi-step “golden” trajectories and agent behavior for tool-using agents in Surge AI’s Corecraft simulated workplace and other custom-built enterprise environments. Audited tool choice, sequencing, termination behavior, and adherence to complex system/developer instructions; refined sub-optimal tool-call plans to improve tool-use efficiency while preserving outcome correctness.
Adversarial / Naturalistic Domain Evaluation (Legal & STEM): Leveraged STEM and Legal qualifications to engineer challenge prompts designed to surface failures in robust models (e.g., Claude Sonnet-tier and Gemini Pro-tier models), including complex multi-modal prompts to test specific STEM domain reasoning capabilities. Precisely documented jurisdiction-sensitive hallucinations and Law-related reasoning errors to create high-fidelity failure examples; authored corrected “gold” responses and scoring rationales aligned to strict rubric sets used for downstream training.
Synthetic Data & Sandbox Populating: Constructed realistic user personas and scenarios to populate enterprise sandbox environments with realistic simulated datasets (Google Workspace, Slack, Jira, Shopify, WhatsApp-style workflows) for authentic agent testing. Converted large amounts of unstructured natural language inputs into strict JSON-formatted datasets to support logical consistency and tool-use evaluation.
Quality Assurance & Scoring Guidelines: Selected for high-priority Rate & Review workflows due to consistent accuracy. Wrote and maintained scoring guidelines/rubrics for new project types, defining ground-truth standards for agentic behaviors and instruction-following.
Complex Instruction Adherence: Executed tasks requiring adherence to evolving guideline documents (including long-form docs updated weekly), maintaining accuracy across frequent project updates and rapid spec changes.