Lucas Bandarkar
UCLA — Ph.D. Student
Machine Learning, Natural Language Processing
Summary
I'm a second-year A.I. Ph.D. student in the Computer Science department at UCLA. I'm advised by Nanyun (Violet) Peng in the PLUS Lab and study multilinguality in LLMs.
Before this, I spent over two years as a research data scientist at Meta/Facebook AI working on large-scale multilingual NLP. There, I generally focused on model evaluation, resource creation & data annotation, and global language strategy for a suite of production models such as machine translation, language identification, and text embeddings. Notably, I led the development of the Belebele dataset (GitHub, HuggingFace), which has over a million downloads. During my undergrad at UC Berkeley, I worked in Marti Hearst's NLP lab under the mentorship of Philippe Laban.
Research Interests
multi-/cross-lingual text representations: model interpretability, language adaptation & cross-lingual transfer, modular & "language-agnostic" representations, tokenization & vocabulary
multilingual evaluation: data annotation & resource creation, embeddings evaluation, translation evaluation
multilingual training data: data quality evaluation & filtering, language identification, data balancing
Applications: multilingual embeddings, LLM language adaptation, LMs in low-resource languages, language identification, machine translation
Employment
Research Scientist Intern, Meta AI
Jun 2024 - Sep 2024
LLM cross-lingual transfer
Research Data Scientist, Meta AI
Aug 2021 - Sep 2023
(Data Scientist from Aug 2021 - Nov 2022)
machine translation, language identification, multilingual text embeddings, multilingual optical character recognition, Arabic dialect identification,
machine translation for human content review & automated moderation
Data Scientist Intern, Meta AI
May 2020 - Aug 2020
optical character recognition
Education
(in progress) Ph.D. in Computer Science, UCLA
Sep 2023 - current
B.A. in Statistics, Data Science, UC Berkeley
Aug 2017 - May 2021