Was about to scrape pages from a CC-licensed website (BY-NC-SA). Which made me think of the practice of webscraping. In my experience (as a non-coder), webscraping is something of a craft.
- Who in the OER community has developed solid competencies in webscraping?
- Which tools/techniques could you share for doing it properly? (e.g. do you use Beautiful Soup? Goose?)
- Has there been work done on the harvesting side of the OER scene which connects to webscraping?
- Is there a proper way to “fork” content this way, keeping some kind of trace of how content diverges?
For background… The site I’m about to scrape is a Google Site containing lists upon lists of tools that we can use in learning contexts. (So, a bit more “meta” than true OERs.) The site hasn’t been updated since 2014.
One thing I’m thinking of doing in the longer term is to use these lists as the basis for actual OERs by connecting the tools to competencies that learners are expected to develop.