Company Logo
Software Engineer

Netflix - 1d ago

Company Logo
Senior Software Engineer

Reddit - 4d ago

Web Crawling Engineer

Requirements

  • Proficiency in Go (Golang)/Rust/Zig for building scalable and efficient web crawlers
  • Deep understanding of TCP, UDP, TLS and HTTP/1.1,2,3 protocols and web communication
  • Knowledge of HTML, CSS, and JavaScript for parsing and navigating web content
  • Familiarity with cloud platforms (AWS, GCP), orchestration (Kubernetes, Nomad), and containerization (Docker) for deployment
  • Mastery of queues, stacks, hash maps, and other data structures for efficient data handling
  • Ability to design and optimize algorithms for large-scale web crawling
  • Hands-on experience with networking and web scraping libraries
  • Understanding of how search engines work and best practices for web crawling optimization
  • Experience with SQL and/or NoSQL databases (knowing Aerospike is a bonus) for storing and managing crawled data
  • Familiarity with data warehousing and scalable storage solutions
  • Knowledge of distributed systems (e.g., Hadoop, Spark) for processing large datasets

Nice-to-Haves

  • Experience with web archiving projects & tooling, open-source archiving is a big plus!
  • Experience applying Machine Learning to improve crawling efficiency or accuracy
  • Experience with low-level networking programming and/or userspace TCP/IP stacks

What you will do

  • Developing and maintaining web crawlers using Go to extract data from target websites
  • Utilize headless browsing techniques, such as Chrome DevTools, to automate and optimize data collection processes
  • Collaborate with cross-functional teams to identify, scrape, and integrate data from APIs and web pages to support business objectives
  • Create and implement efficient parsing patterns using tokenizers, regular expressions, XPaths, and CSS selectors to ensure accurate data extraction
  • Design and manage distributed job queues using technologies such as Redis, Aerospike, and Kubernetes to handle large-scale distributed crawling and processing tasks
  • Develop strategies to monitor and ensure data quality, accuracy, and integrity throughout the crawling and indexing process
  • Continuously improve and optimize existing web crawling infrastructure to maximize efficiency and adapt to new challenges

Perks and Benefits

  • Dynamic and collaborative team passionate about AI
  • Opportunity to work with cutting-edge models, products, and solutions
  • Distributed workforce in multiple countries
  • Pioneering company shaping the future of AI
  • Competitive environment driving innovation
  • Comprehensive AI platform for enterprise needs
AI Summary ✨
Mistral AI logo

Mistral AI

France, Spain, Germany, UK, Switzerland, Netherlands, Luxembourg

Experience: Senior
Posted: July 12, 2024
Last seen: 2 hours ago
Aws
Docker
Gcp
Golang
Javascript
Kubernetes
Redis
Rust
backend

Why we track Mistral AI

Mistral AI is a Paris-based AI company building frontier large language models. Founded by former DeepMind and Meta researchers, they're one of Europe's most important AI companies. If you want to work on cutting-edge AI research and infrastructure in Europe, Mistral is the one to watch.

Similar jobs

  • a day ago
    New
    Remote
  • 2 days ago
    New
    Remote
  • nvidia logo

    Senior Software K8S Engineer

    UK, Poland, Germany, France

    5 days ago
  • 5 days ago
    Remote
  • See all jobs in France