Forage AI Careers!

Data Pipeline Engineer– Web Services, WebCrawling, ETL, NLP(spaCy/LLM), AWS

Technology

Full Time

Remote

Experience

5-8 years of relevant experience in data engineering

Qualification

Bachelor’s or Master’s degree in Computer Science or related field

Offered Salary

Based on experience

Posted On

6 October, 2025

Valid Till

6 November, 2025

About Forage AI:

Forage AI is a pioneering AI-powered data extraction and automation company that transforms complex, unstructured web and document data into clean, structured intelligence. Our platform combines web crawling, NLP, LLMs, and agentic AI to deliver highly accurate firmographic and enterprise insights across numerous domains. Trusted by global clients in finance, real estate, and healthcare, Forage AI enables businesses to automate workflows, reduce manual rework, and access high-quality data at scale.

About the Role:

We are seeking a Data Pipeline Engineer to develop, optimize, and maintain production-grade data pipelines focused on web data extraction and ETL workflows. This is a hands-on role requiring strong experience with Python (as the primary programming language), spaCy, LLMs, webcrawling, and cloud deployment in containerized environments. You’ll have opportunities to propose, experiment with, and implement GenAI-driven approaches, innovative automations, and new strategies as part of our product and pipeline evolution. Candidates should have 5-8 years of relevant experience in data engineering, software engineering, or related fields.

Key Responsibilities:

  • icon mark
    Design, build, and manage scalable pipelines for ingesting, processing, and storing web and API data.
  • icon mark
    Develop robust web crawlers and scrapers in Python (Scrapy, lxml, Playwright) for structured and unstructured data.
  • icon mark
    Create and monitor ETL workflows for data cleansing, transformation, and loading into PostgreSQL and MongoDB.
  • icon mark
    Apply spaCy for NLP tasks and integrate/fine-tune modern LLMs for analytics.
  • icon mark
    DriveGenAI-based innovation and automation in core data workflows.
  • icon mark
    Develop and deploy secure REST APIs and web services for data access and interoperability.
  • icon mark
    Integrate RabbitMQ,Kafka, SQS(for distributed queueing), and Redis (for caching) into data workflows; also proficient with distributed queue tools such as Celery, TaskIQ.
  • icon mark
    Containerize and deploy solutions using Docker on AWS(EC2, ECS, Lambda).
  • icon mark
    Collaborate with data teams, maintain pipeline documentation, and enforce data quality standards.
  • icon mark
    Maintain and enhance legacy in-house applications as required.

Technical Skills & Requirements:

  • icon mark
    Primary programming language is Python; must have experience writing independent Python packages.
  • icon mark
    Experience with multithreading and asynchronous programming in Python.
  • icon mark
    Advanced Python skills, including web crawling (Scrapy, lxml, Playwright) and strong SQL/data handling abilities.
  • icon mark
    Experience with PostgreSQL (SQL) and MongoDB (NoSQL).
  • icon mark
    Proficient with workflow orchestration tools such as Airflow.
  • icon mark
    Hands-on experience with RabbitMQ, Kafka, SQS(for queueing/distributed processing), and Redis (for caching).
  • icon mark
    Practical experience with spaCy for NLP and integration of at least one LLM platform (OpenAI, HuggingFace, etc.).
  • icon mark
    Experience with GenAI/LLMs, prompt engineering, or integrating GenAI features into data products.
  • icon mark
    Proficiency with Docker and AWS services (EC2, ECS, Lambda).
  • icon mark
    Experienced in developing secure, scalable REST APIs using FastAPI and/or Flask.
  • icon mark
    Familiarity with third-party APIs integration, including authentication, data handling, and rate limiting.
  • icon mark
    Proficient in using Git for version control and collaboration.
  • icon mark
    Strong analytical, problem-solving, and documentation skills.
  • icon mark
    Bachelor’s or Master’s degree in Computer Science or related field.

What We Offer:

  • icon mark
    High ownership and autonomy in shaping technical solutions and system architecture.
  • icon mark
    Opportunities to learn modern technologies and propose technical initiatives— including GenAI-based approaches.
  • icon mark
    Collaborative, supportive, and growth-oriented engineering culture.
  • icon mark
    Exposure to a broad set of business and technical problems.
  • icon mark
    Structured onboarding and domain training.
  • icon mark
    Work-from-Home Infrastructure.

Infrastructure Requirements:

Since this is a completely work-from-home position, you will also require the following –

  • icon mark
    Business-grade computer (modern processor i7, i9 , 16 GB+ RAM) with no performance obstacles.
  • icon mark
    Reliable high-speed internet for video calls and remote work.
  • icon mark
    Quality headphones & camera for clear audio and video. Stable power supply and backup options in case of outages.

Apply Now