Webcrawling with Python & Selenium

In a recent project i had to extract data from a website which did not offer any API-access. To make matters worse, the data gets loaded via JS, so trying to automate python-requests would be a huge pain in the butt. If you have come across the same problem, you probably also stumbled upon Selenium.

While the official Docs can be pretty bad at times, especially if you’re just starting out and have no idea which modules you need to import, there are lots of stackoverflow-threads which will get you going along.

But there was one Problem that i could not read anywhere about, which made me write this post. I was planning to let my webcrawler run on a headless ubuntu-server (evnironmental choices were not in my domain). It didn’t seem like a huge problem, since in my tests i was using the --headless option for firefox. I believed, this would make it easy to run on an actual headless machine too. (read more about headless firefox).

As a first measure i installed XVFB and enabled a virtual display on the server.

But then i was facing an issue with the installation of firefox on a headless machine. Firefox was pulling certain gtk modules, and even once everything got installed, the Instanciation could not be completed. Pulling the standalone geckodriver did not work either, resulting in Webdriver Errors.

As a last measure i was testing Chrome’s compatibility. And i made a surprising discovery! Chrome was starting up fine, without any trouble. (Chrome Download for .deb Package)

There are some minor differences i discovered, in how Firefox & Chrome handle Tab-management. For example, a page-element from firefox could be stored & accessed even when the page itself was no longer in focus. Chrome doesn’t allow that, which resulted in a quick work-around with separate list-elements.

TIL: Chrome can be useful once after all! (in a very narrow use-case)

Thanks for reading. ~Coni