The Practicalities of OER Web Scraping

Was about to scrape pages from a CC-licensed website (BY-NC-SA). Which made me think of the practice of webscraping. In my experience (as a non-coder), webscraping is something of a craft.

  • Who in the OER community has developed solid competencies in webscraping?
  • Which tools/techniques could you share for doing it properly? (e.g. do you use Beautiful Soup? Goose?)
  • Has there been work done on the harvesting side of the OER scene which connects to webscraping?
  • Is there a proper way to “fork” content this way, keeping some kind of trace of how content diverges?

For background… The site I’m about to scrape is a Google Site containing lists upon lists of tools that we can use in learning contexts. (So, a bit more “meta” than true OERs.) The site hasn’t been updated since 2014.
One thing I’m thinking of doing in the longer term is to use these lists as the basis for actual OERs by connecting the tools to competencies that learners are expected to develop.

Thanks!

–Alex

It’s definitely an alchemy like craft, though you run the range form scraping for data collection e.g. harvesting data from multiple sites or different time slots of the same site to create something new to more of the archival approach (things like Webrecorder or Archive.is).

I am not raising my hand for having solid competencies, more of a flyby curiosity. I have used SiteSucker an OSX app for making static HTML archives of WordPress sites that I archived – but this does not sound where you are fishing.

But I am curious about what you are aiming to do.

Thanks for the hints, Alan!

Ditto! :wink:

There’s a simple part: I’m thinking about this site:

(CC BY-NC-SA 2.5 CA)

It contains several pages with information on software used in teaching. A bunch of tables, basically.

There are other tables like these. In fact, after posting my query, I was notified of a Community of Practice’s list of Free/Libre Open Source Software by discipline.

What I’m trying to do, there, is a compilation of all of these things.

Doing it by hand would probably not take that long. I just get the “urge” to automate the process. Call it the xkcd effect.
Should I Automate This? | Hackaday

Actually, I perceive it as an opportunity to develop skills.

Where things get murkier, is that my own “flyby curiosity” leads me to Jupyter Notebooks.
On Not Hating Jupyter… – OUseful.Info, the blog…

…and the potential connection to PreTeXt.
PreTeXt (pretextbook.org)

…and ways to integrate all of this in PressBooks (for High Tech OERing!).

So…

I’m in an exploratory phase for a workflow in which we can package interactive content, equations, musical transcriptions, Open Data, etc.

The webscraping part can work quite well in Jupyter Notebooks. I’ll probably prototype it in Google Colab…
Get started with Google Colaboratory (Coding TensorFlow) - YouTube

Just noticed that the link to the CoP’s list was public. So I feel free to share it. (There’s a description in English.)

Free software and freeware in the Collegial Network - Google Spreadsheet

I also realize that freeware and freemium services are mixed in with the F/LOSS. Same with the site I want to scrape.

Also for context… I’ve recently joined the board of Adte.ca, an organization promoting F/LOSS in HigherEd. Something we’ve discussed this week is using Free Software to create OERs… and using OER approaches to document Free Software.

Thanks, I am still digesting all of this!

It might not be related, but this idea of git scraping is intriguing

Thanks!

As you might notice by now, some (many?) of my questions are half-baked thoughts and/or “messages in bottles”. Thankfully, it often generates something useful when it goes in a different direction.

This one might help with some data literacy work I’m planning. Which leads back to the connections between Open Data and Open Education.

So many Opens! So little time!

I’ve done a decent amount of scraping with Google Sheet’s IMPORTXML function and Google Script/javascript (although I always hope for an API instead). I’ve dabbled some in Python and PHP to do it as well.

I often rely on nice little tools like the browser plugin Scraper for lightweight stuff. IMPORTHTML in Google sheets is a handy way to get tables.

I came across Spatula the other day and will likely give it a shot the next time something comes up.

Sorry for the lack of links. I can only use two as a new user.

1 Like

Thanks!
And welcome aboard.

Those solutions based on Google Sheets do sound quite interesting. Haven’t done much in Google Sheets for a while but these sound very appropriate (especially since scraping tables is the main thing I originally wanted to do).

I’ve gone back to Python for this type of thing. Since I’m not a coder, it takes me quite a bit of time and I need to be in the right “mode” to do it. Once I do get something together, it’s a lot more flexible and powerful than most off-the-shelf solutions. I currently do it in Google Colab as Jupyter Notebooks with pandas, Beautiful Soup, etc.

Though these sorts of technical questions may not bring the OE Global community together, it sounds to me like a bit of “tooling” could enhance our work.

Thanks again!

Hey, great to see you here, Tom. Thanks for chiming in. I think I know your path here, so thanks for reading.

For those who are not scrapers (aka me), it might help to know what you are scraping and what for?

1 Like

I came for the Cog and stayed for the Dog. :slight_smile:

My scraping has been driven by random faculty trying to do get data for analysis with occasional personal departure that tend to be for amusement. I also have done a number of stackoverflow questions on this over the years. That tends to be sports/stocks type of things with people wanted to do it in Google Sheets.

In general, it’s about data analysis about 95% of the time. The other pieces tend to be about improving or altering what I can do with the data. I’ll often pull out big tables and put them into spreadsheet to manipulate them more easily. Stuff like that.

Stuff I have scraped.

  • various things from Instagram related to public health after they shut down most of the API (vaping, ecigs and views, hashtags etc.)
  • CDC comments on Facebook on posts related to Ebola for racism analysis
  • a giant discussion board on getting green cards
  • twitter messages from an educational charter school (for past data when they didn’t set up TAGS ahead of time)
  • youtube video views, likes, dislikes for a series of videos that were not owned by the faculty member
  • Gangham style views on youtube
  • sexual assault in higher ed page
  • Snoop Dog’s instagram followers
  • a bookmarkelet to let me play Alan Lomax songs without going through an unpleasant archive interface
  • an export bookmarklet to get content out of Top Hat so it could be put in PressBooks
1 Like

Appreciate it Enkerli.

I often suggest Google Sheets for people who aren’t interested/don’t have time for coding but you sound like you manage quite well. There is a good guide here if you end up looking into sheets more.

I feel like there’s a scraping guide for faculty out there that I’ve seen. I’ll see if I can dig it up.

TAGS!
Haven’t heard of that in a while. It was my primary reason to use Google Sheets, at some point.
For idiosyncratic reasons, it reminds me of NodeXL…

And while I fully realize this may all be tangential to OEG, I perceive a path to something very useful about the connections between Open Data, Open Pedagogy, and Open Educational Resources (with a hint of Free/Libre Open Source Software, though Google services are notoriously very much nonfree, especially if you’ve ever heard RMS talk in public).

Basically, there’s something about dynamic resources for learning. A bit like Adaptive Learning (with Learning Analytics and such). Yet much more about opening the spigot from Open Data firehoses and having your learning material accommodate for that data influx.

Since it’s early Summer, here, I’m in the most actively divergent part of my design thinking journey. It involves disparate dimensions from curation and annotation (last week was #ianno21) all the way to forking H5P from a content bank integrated in a federated LMS.

Sooooo… Thanks Tom for the thought juice.

Twitter TAGS is remarkable and Martin Hawksey a genius. He had more than was there to meet the eye.

My flip on analytics in co-teaching Networked Narratives was with having student activities in twitter, hypothes.is was asking them to evaluate and reflect on what their activity within the larger network reflected.

A trick Martin shared came in handy, because in a large, Death Star shaped visualization of an active hashtag how does one find themselves? Just tack on to that URL &name=twittername and thus it was a link to one person’s node, say for the sake of other’s privacy, my activity.

So we would ask students to characterize their activity and their “place” in the larger network. This turns analytics inside out (maybe), where the machine just takes the data. For even more mine blowing, for any single user’s node, click the “Replay Tweets” button. I am not sure anyone else even saw that, but it plays back your history in time.

Asking students to share their link in the network, and another link to say, their hypothes.is annotations, provided a view into their activity. Better than counting comments.

(I should say, twitter was never required, we offered students to participate in other ways if it was not a safe or viable space for them).