Git scraping: track changes over time by scraping to a Git repository

Last modified on October 12, 2020

Git scraping is the title I’ve given a scraping formulation that I’ve been experimenting with for simply a few years now. It’s the reality is environment friendly, and additional of us could perchance presumably perchance additionally nonetheless command it.

The internet is stuffed with intriguing data that changes over time. These changes can as soon as in a whereas be additional intriguing than the underlying static data—The @nyt_diff Twitter account tracks changes made to Recent York Instances headlines as an illustration, which provides a attention-grabbing perception into that publication’s editorial course of.

We already have a broad instrument for successfully monitoring changes to textual say over time: Git. And GitHub Actions (and diversified CI strategies) mark it simple to mark a scraper that runs each couple of minutes, recordsdata the brand new narrate of a helpful useful resource and recordsdata changes to that helpful useful resource over time inside the commit historic previous.

Here’s a latest instance. Fires proceed to rage in California, and the CAL FIRE web site provides an incident draw displaying primarily probably the most as a lot as date hearth exercise spherical the narrate.

Firing up the Firefox Community pane, filtering to requests launched about by XHR and sorting by dimension, best first finds this endpoint:

https://www.hearth.ca.gov/umbraco/Api/IncidentApi/GetIncidents

That’s a 241KB JSON endpoints with beefy necessary capabilities of the diversified fires spherical the narrate.

So... I began working a git scraper in opposition to it. My scraper lives inside the simonw/ca-fires-historic previous repository on GitHub.

Every 20 minutes it grabs primarily probably the most as a lot as date replica of that JSON endpoint, pretty-prints it (for diff readability) utilizing jq and commits it encourage to the repo if it has modified.

This formulation I now rep a commit log of changes to that data about fires in California. Here’s an instance commit displaying that remaining night time the Zogg Fires share contained elevated from 90% to 92%, the collection of personnel keen dropped from 968 to 798 and the collection of engines responding dropped from 82 to 59.

Screenshot of a diff against the Zogg Fires, showing personnel involved dropping from 968 to 798, engines dropping 82 to 59, water tenders dropping 31 to 27 and percent contained increasing from 90 to 92.

The implementation of the scraper is fully contained in a single GitHub Actions workflow. It’s in a file known as .github/workflows/predicament.yml which seems get pleasure from this:

title: Plight most as a lot as date data

on:
  push:
  workflow_dispatch:
  agenda:
    - cron:  '6,26,46 '

jobs:
  scheduled:
    runs-on: ubuntu-most as a lot as date
    steps:
    - title: Evaluate out this repo
      makes use of: actions/checkout@v2
    - title: Discover most as a lot as date data
      bustle: |-
        curl https://www.hearth.ca.gov/umbraco/Api/IncidentApi/GetIncidents | jq .> incidents.json
    - title: Commit and push if it modified
      bustle: |-
        git config consumer.title "Computerized"
        git config consumer.e-mail "actions@customers.noreply.github.com"
        git add -A
        timestamp=$(date -u)
        git commit -m "Most as a lot as date data: ${timestamp}" || exit 0
        git push

That’s not a unfold of code!

It runs on a agenda at 6, 26 and 46 minutes previous the hour—I try to offset my cron cases get pleasure from this since I determine that nearly all of crons bustle exactly on the hour, so working not-on-the-hour feels well mannered.

The scraper itself works by fetching the JSON utilizing curl, piping it by jq . to pretty-print it and saving the consequence to incidents.json.

The “commit and push if it modified” block makes use of a sample that commits and pushes preferrred if the file has modified. I wrote about this sample on this TIL simply a few months inside the previous.

I the reality is rep an complete bunch of repositories working git scrapers now. I’ve been labeling them with the git-scraping matter in order that they repeat up in a single area on GitHub (diversified of us rep started utilizing that matter as well).

I’ve written a few few of those inside the previous:

  • Scraping hurricane Irma encourage in September 2017 is after I first got here up with the premise to command a Git repository on this formulation.
  • Changelogs to wait on hint the fires inside the North Bay from October 2017 describes an early try at scraping fire-connected data.
  • Generating a commit log for San Francisco’s succesful record of bushes stays my in style utility of this formulation. The City of San Francisco maintains a usually up to date CSV file of 190,000 bushes inside the metropolis, and I the reality is rep a commit log of changes to it stretching encourage over greater than a yr. This state of affairs makes use of my csv-diff utility to generate human-readable commit messages.
  • Monitoring PG&E outages by scraping to a git repo paperwork my makes an strive to hint the impression of PG&E’s outages remaining yr by scraping their outage draw. I aged the GitPython library to expose the values recorded inside the commit historic previous true into a database that permit me bustle visualizations of changes over time.
  • Monitoring FARA by deploying

Read More

Similar Products:

Recent Content