Lead Data Quality Engineer

1100 Epiq eDiscovery Solutions, Inc.
Washington, US
Remote

Job Description

At Epiq, your work contributes to complex, global legal outcomes. You’ll join a values‑driven community where integrity guides decisions, relentless service sets the bar, and we thrive on big challenges together. We invest in your growth with enterprise‑wide learning and mobility. We celebrate who you are, and we respect life beyond work with flexibility that’s recognized externally. Enabled by modern platforms and AI, you’ll do the most meaningful work of your career and see your impact at scale. Job Description: Job Summary: Responsible for overseeing and driving the development and success of a product throughout its lifecycle. This role will act as the bridge between Prepares and manages the data that the Copilot relies on. The Data Quality Engineer’s mission is to ensure the AI always has access to accurate and up-to-date information. They build pipelines to collect and update the knowledge base (documents, FAQs, databases) that the AI uses, and enforce data quality standards so the AI’s answers are based on solid data. This role closely collaborates with the LLM Strategist to provide training/evaluation datasets, and with the Solutions Engineer to integrate these data pipelines into the overall system. Key Responsibilities: Data Pipeline Development: Create and maintain ETL (Extract-Transform-Load) processes that gather data from various enterprise sources into the Copilot’s knowledge repository. For example, develop a pipeline to extract policy documents from SharePoint or a document management system, transform or index them (perhaps splitting into chunks, encoding as vectors or populating a search index), and load them into a format the AI can use for retrieval. Use tools like Azure Data Factory, Logic Apps, or custom Python scripts to schedule regular updates (e.g., sync new or edited documents nightly). Data Integration & Indexing: Implement the data storage/indexing solutions that the AI will query at runtime. This could involve setting up an Azure Cognitive Search index or a vector database (for semantic search of text) and feeding it with processed data. Ensure that for each type of data (policies, past Q&As, regulations), the relevant fields (metadata, embeddings, etc.) are properly stored for efficient retrieval. Work with the Solutions Engineer to connect these stores to the AI application (e.g., via APIs or SDKs). Quality Assurance & Cleansing: Establish data quality checks at each step of the pipeline. Deduplicate records, ensure consistent formatting (e.g., all dates in a standard format, text is cleaned of strange characters), and filter out irrelevant content. If integrating data from multiple sources, resolve conflicts or overlaps (e.g., if two sources have a definition for a term, determine which one is authoritative or how to consolidate them). Use techniques like sampling and validation scripts to verify that the data loaded is correct and complete (for instance, compare record counts or hash sums to make sure nothing was missed). Data Update & Monitoring: Monitor the freshness of data. Set up alerts or reports for data pipeline failures so they can be fixed before users notice stale info. For example, if a nightly update fails and some new documents weren’t indexed, have a way to catch that (via logs or a monitoring dashboard) and rerun the job. Also, design pipelines to be idempotent and recoverable – e.g., if a run is interrupted, it can pick up or safely restart. Coordinate with content owners in the company for any major data changes (if a new data source is added or an old one decommissioned, adjust pipelines accordingly). Support Model Training Data Needs: When the LLM Strategist needs curated datasets for fine-tuning or testing the AI, assist in assembling those. For instance, extract historical customer questions and answers from a database to create a training file, or gather a set of paragraphs labeled as relevant/irrelevant to train a classifier. Ensure any data used for model training is cleaned and formatted to the requirements of the ML process. Keep versioned copies of these datasets as they may be needed for future reference or re-training. Data Governance & Security: Handle all data in accordance with company policies and regulatory requirements. Ensure sensitive data is protected (e.g., if certain documents are confidential, ensure access controls are in place and that the AI either doesn’t index them or is restricted from exposing them in answers). Work with IT/security on any data handling reviews. Maintain documentation of data sources, data flow diagrams, and data dictionaries so it’s clear where information is coming from and how it’s transformed – this transparency aids compliance checks and team understanding. Collaboration: With LLM Strategist: Share insights about data coverage and limitations. For example, inform them if certain topics have very few documents, which might affect the AI’s knowledge. Get requirements for what training data is needed and d

Skills & Requirements

Technical Skills

Etl processesAzure data factoryLogic appsCustom python scriptsAzure cognitive searchVector databaseData quality checksData integrationIndexingData updateMonitoringData handling reviewsDocumentationCollaborationTeamworkCommunicationProblem-solvingData qualityAiLlm

Employment Type

FULL TIME

Level

lead

Posted

4/19/2026

Continue to Workday

You will be redirected to the job posting on Workday.