Forage AI Careers!

Data Pipeline Engineer– Web Services, WebCrawling, ETL, NLP(spaCy/LLM), AWS

Technology

Full Time

Remote

Apply for this job

Experience

5-8 years of relevant experience in data engineering

Qualification

Bachelor’s or Master’s degree in Computer Science or related field

Offered Salary

Based on experience

Posted On

6 October, 2025

Valid Till

6 November, 2025

Apply for this job

About Forage AI:

Forage AI is a pioneering AI-powered data extraction and automation company that transforms complex, unstructured web and document data into clean, structured intelligence. Our platform combines web crawling, NLP, LLMs, and agentic AI to deliver highly accurate firmographic and enterprise insights across numerous domains. Trusted by global clients in finance, real estate, and healthcare, Forage AI enables businesses to automate workflows, reduce manual rework, and access high-quality data at scale.

About the Role:

We are seeking a Data Pipeline Engineer to develop, optimize, and maintain production-grade data pipelines focused on web data extraction and ETL workflows. This is a hands-on role requiring strong experience with Python (as the primary programming language), spaCy, LLMs, webcrawling, and cloud deployment in containerized environments. You’ll have opportunities to propose, experiment with, and implement GenAI-driven approaches, innovative automations, and new strategies as part of our product and pipeline evolution. Candidates should have 5-8 years of relevant experience in data engineering, software engineering, or related fields.

Key Responsibilities:

Design, build, and manage scalable pipelines for ingesting, processing, and storing web and API data.
Develop robust web crawlers and scrapers in Python (Scrapy, lxml, Playwright) for structured and unstructured data.
Create and monitor ETL workflows for data cleansing, transformation, and loading into PostgreSQL and MongoDB.
Apply spaCy for NLP tasks and integrate/fine-tune modern LLMs for analytics.
DriveGenAI-based innovation and automation in core data workflows.
Develop and deploy secure REST APIs and web services for data access and interoperability.
Integrate RabbitMQ,Kafka, SQS(for distributed queueing), and Redis (for caching) into data workflows; also proficient with distributed queue tools such as Celery, TaskIQ.
Containerize and deploy solutions using Docker on AWS(EC2, ECS, Lambda).
Collaborate with data teams, maintain pipeline documentation, and enforce data quality standards.
Maintain and enhance legacy in-house applications as required.

Technical Skills & Requirements:

Primary programming language is Python; must have experience writing independent Python packages.
Experience with multithreading and asynchronous programming in Python.
Advanced Python skills, including web crawling (Scrapy, lxml, Playwright) and strong SQL/data handling abilities.
Experience with PostgreSQL (SQL) and MongoDB (NoSQL).
Proficient with workflow orchestration tools such as Airflow.
Hands-on experience with RabbitMQ, Kafka, SQS(for queueing/distributed processing), and Redis (for caching).
Practical experience with spaCy for NLP and integration of at least one LLM platform (OpenAI, HuggingFace, etc.).
Experience with GenAI/LLMs, prompt engineering, or integrating GenAI features into data products.
Proficiency with Docker and AWS services (EC2, ECS, Lambda).
Experienced in developing secure, scalable REST APIs using FastAPI and/or Flask.
Familiarity with third-party APIs integration, including authentication, data handling, and rate limiting.
Proficient in using Git for version control and collaboration.
Strong analytical, problem-solving, and documentation skills.
Bachelor’s or Master’s degree in Computer Science or related field.

What We Offer:

High ownership and autonomy in shaping technical solutions and system architecture.
Opportunities to learn modern technologies and propose technical initiatives— including GenAI-based approaches.
Collaborative, supportive, and growth-oriented engineering culture.
Exposure to a broad set of business and technical problems.
Structured onboarding and domain training.
Work-from-Home Infrastructure.

Infrastructure Requirements:

Since this is a completely work-from-home position, you will also require the following –

Business-grade computer (modern processor i7, i9 , 16 GB+ RAM) with no performance obstacles.
Reliable high-speed internet for video calls and remote work.
Quality headphones & camera for clear audio and video. Stable power supply and backup options in case of outages.