Wednesday, 29 May 2013

The Next Generation of Aggregation and RSS Readers

I've been thinking ( again ) about RSS readers, what they are, what they are for, where ( if anywhere ) they're going next and lastly, how I might maybe make my own ( on the cheap ) .

What is an RSS reader anyway?

Avoiding the technical specification of what RSS is... it is just a way of collecting explicit subscriptions to news from various sites. It is markedly NOT email.

Along the way, RSS forgot, or were told to forget, they were also aggregators. Aggregators were cousins of RSS readers in that they collected lots of news together and re-published it, normally as a web page. In the olden days ( 2005ish ) there were lots of aggregators. There were tools to make your own aggregators and aggregators were important, in my opinion because they did the hard curatorial work or selecting related news sources and making them available, normally in a format you could subscribe to.

As more and more people came online, producing news, consuming news, many aggregators - who were normally making no money whatsoever - went to the wall. Most desktop-based RSS software tools simply failed under the demands of people who had thousands and thousands of subscriptions. Google Reader, one of the few tools that seemed to handle this scale, slowly wiped out both other RSS readers ( online or desktop based ) AND aggregator tools and sites.


The Next Generation of RSS Readers

The next generation of RSS readers, weren't really RSS readers at all. Tools like
Flipboard and Feedly looked to broaden your news reading reach to bring you both the good stuff and also include your personal connections. One of the big problems with RSS was it's subscribing mechanism, which to this day is way too geeky really. Just explaining to someone what to look for if they wanted to subscribe to a site was a usability nightmare. "It will be called RSS or Atom or Latest Entries" or maybe it's in the source code, oh it might have an orange icon or be called XML".

Removing the "subscription" aspect from news reading kills it dead. The loathsome Summly, is a good example, bringing you the sort of news worse than you find in a free newspaper scattered all over the bus floor.

Feedly tries to extend your news reading range, but in my opinion, does it poorly. One of the BIG problems with new subscriptions is that after a while you get bored of them. Your interests change and migrate and a good news reader tool should allow this to happen naturally and not try to do it, with AI magic, for you.



Where They're All Going Wrong


In an article called Produce Organise Consume ( Jan 2011) I point out the strange division between "bookmarking" or tagging, blogging and reading and that for me they were essentially different aspects of the same thing.

These similar activities start to really come into their own when they start feeding the others. When what you are writing about, or bookmarking starts affecting what you are brought to read, for example.

The hidden glue in these three blobs is Connection. Those connections might be explicit, based on another connection or deduced or part of some fantastic artificial back end joining mysterious pieces together ( although probably not ).




The then 2009ish "holy trinity" of Delicious (for organising), Google Reader(for reading) and Blogger(for writing) is looking like a dead threesome in the water with Delicious foundering and not know what it is anymore, Reader on death row and Blogger slowly dying from neglect in an unknown address.


Production and Consumption have moved to Twitter and Facebook ( Google+ maybe ) and Organisation lurks in Twitter hashtags and trending topics, loosely joining things together. Bringing these three activities together shouldn't be too hard. I don't even think it needs one uber-tool to dominate how they happen ( people are very pernickety about how they work ) but it does need to be thought about.

And it's clear that Google aren't thinking on these simple basic activity lines. Where Google Keep fits into my model is obvious, it's an Organising thing - but it doesn't fit well with the Production or Consumption thing, and it really could. The same is true of Google+ which fails awfully as a tool of Production or Organisation ( it'd be nice to see a page of the hashtags I'd used for example ).



Can I Make My Own Aggregator?

And so, given that the world isn't dancing to my tune, or even in the same beat, I wonder if, on a small scale I could make my own aggregator cum reader cum writing platform cum organiser using parts that already exist. 

There are remarkably few open source tools to run aggregators that I liked, Wp-o-matic and Planet Python being exceptions in that they were easily adaptable into something else, but still far from ideal. To be fair it's years since I looked but I recall trying HUNDREDS of them.

My aggregator would need to be:

  • Small - Not dealing with a huge amount of data
  • Shareable - browsable - subscribeable ( offering search term feeds is great )
  • Accessible via more than the obvious "what's new" interface, including tags and connections etc
  • Run in the cloud
  • Integrate with a Production and Collation process somehow.

The parts that already exist


You've probably guessed that the parts I want to work with are Google Spreadsheets, Google Docs and Apps Script. I don't think I can use a ScriptDB or Fusion table because there would be just too many read/writes to their databases even though they deal with large amounts of data very well.

It would be possible to create an aggregator that collected, say less than a thousand feeds and saved the articles in a Google Spreadsheet. But it's worth paying attention to Google Spreadsheet limits and quotas.

Spreadsheets: 400,000 cells, with a maximum of 256 columns per sheet. 
Number of Tabs: 200 sheets per workbook
Note: The limit on the number of ImportHtml functions per spreadsheet is 50. (from @mhawksey)


So, if I used Google Spreadsheets I might have to have monthly rollover, creating a new spreadsheet for each month. And so that connections between items ( via tags ) worked across months, I might need to render them as HTML and use Google Drive hosting to serve them.

I'm just mulling over if making a very simple aggregator with Google Parts is sensible or not. I would still like to be able to show a TagCloud of news concepts from all the various social media corners of the University of York like the one shown below ( from  of my PPPeoplePPPowered project in 2010 ).



And most interestingly ( to me ) was the ability to connect people and concepts in a network. This was achieved by processing each news article using Open Calais for the concepts contained.



I think what I'm trying to say, that in order to convince people that there is real value in working with blogs ( or whatever means of online connected Production ) you need to "hook them" and show them an explicit example of how everything fits together, but that example can't be one you have chosen, it has to fit their world model.

An real life example I've worn out a little from overuse is when, showing a lecturer how blogging works I suggested they added a tag to their post. They added the tag "ahrc" to a post about working on a funding bid. They clicked the tag and found another lecturer in another department working on the same bid. He said he'd immediately go for a chat and see if there was an opportunity for collaboration.

This stunningly simple example can be made to happen over and over if only we can connect the ephemera of writing and reading and organising from wherever people are doing. I think an aggregator of tweets and wiki edits and Google Site updates and blog posts from York staff would be a start in letting a lot more people into this area of what can seem at times, just too geeky for some.








Friday, 24 May 2013

Google Apps, New Possibilities for Old Tools?

Last week Google announced a number of new Apps Script features that have been added to Google Documents and Forms and Spreadsheets.

The features themselves may not seem worth shouting that loudly about, but the ability to add Sidebars to documents and add menus and arbitrary user interface items that run Apps Script code means you can start to dream about how you could extend and combine these really powerful objects in new ways.

Google Apps was already a collection of powerful objects ( Documents, Spreadsheets, Drive Files, Forms, Calendars, Sites ) that could be easily combined with Apps Script to create really useful applications, but with the arrival of these new features, the ability to combine them can be more elegant. And because you can create tools and interfaces within the documents you can extend the tools - rather than just combine them.

We've already seen a demo of Bibsto, an Apps Script Bibliography manager that changes Google Documents into Research papers - with added tools for adding references and citations.

We are going to see a flourishing of new custom-made add ons to Google Apps, mark my words. For example, Martin Hawksey has already been trying to create a "Document Map" feature, ala Word. There's a feature I'd love... and if only Martin and I want it, so be it.

More importantly, there's a high chance you might create some innovative tool yourself that makes Google Documents work the way you want them to. We'll see add ons for writers or for educators or for marketers or cheese makers or just you.

Ooooh, and it makes me wonder...

One of my long-standing criticisms of Google, from a UX perspective, is that many of their innovations seem isolated from the other ones - meaning simple features get implemented in one product, but not another creating an overall usability glitches of nothing working quite as you'd expect. For example, how search works, drag & drop, document ownership models, commenting, API access etc are just some of the things that work one way in one product and work differently in another.

The future is already here in Google Apps land, it's just not evenly distributed.

And so, whilst it's great that Google are starting to make scripting features available across two or three products, it does make you notice where the gaps are.

Google, being Google, often fail to see what they're sitting on and cock things up. For example, Google Wave should not have been a product, it should have been the real-time commenting system for all Google products. If delivered well, you would have barely noticed its existence ( except maybe you'd notice how appalling other systems were in this regard ).

So, What Should Google Do?

Mind The Gaps

There is something to the "small pieces loosely joined" idea that has legs. Originally it applied, I think, to blogs and RSS and other web2.0 tools that you could "wire together" to create newer, bigger, more complex things. It also applies to the ideas behind unix and maybe Galls Law ( "A complex system that works is invariably found to have evolved from a simple system that worked" ). Google seem to be creating a compelling landscape of tools wired together using Apps Script.

Google simply need to look at what they have and ask "What if the good stuff was everywhere?".

So, where are Presentations and Drawings apps in the "small pieces loosely joined" mix? They're noticeably missing. Now I imagine most people don't really use Google Presentations and Drawings much. I see most people still using Powerpoint ( uploading .ppxs into Google Drive ) and traditional desktop based graphics tools.

Except, what if you could add scripts to shapes in Google Presentations? Scripts that took the user to the next slide, for example, or went to a slide based on which button you clicked. You'd be able to create a narrative experience, or a quiz or a mini learning object that branched in all sorts of directions. Kids could use it.

And what if, like you can in Google Sites, you could insert videos and documents, or get data via Apps Script and put it into fields on screen? You would have an interface builder anyone could use. It would be like an online HyperCard - a tool with which people could pull various resources together and make them work the way they want ( without serious programming ).

If you could script the Drawing app you might be able to create simple animations, or maybe simple visualisations.

And the thing is, all that functionality is sort of sitting there already. It doesn't need much in the way of design, it just needs someone to connect the dots.

And no, Google Presentations or Drawings might not be the *best* tool for creating presentations in, but if they were scriptable, like Documents, they'd suddenly become a new thing, loaded with possibilities, rather than an old thing hanging about being slightly embarrassing.

I think this process of making sure that your innovations touch all parts of your product range is a cheap one to conceive and implement - it's dealing with lots of known knowns.

There are dozens of "Mind The Gap" innovations Google could make that I can guarantee would initiate a huge wave of creativity using Google Tools. How do I know? Well, people are just like that.












Thursday, 23 May 2013

Getting RSS from a site that doesn't offer RSS using Dapper and Yahoo Pipes

I was asked if it was possible to get RSS from a site that doesn't offer RSS.

One site whose content I was interested in was "Community & Networks Connection" - it aggregates lots of "community and collaboration software" news.

 Although the site offers RSS feeds, the news in the RSS feed looks like this below - all of the articles are chunked into daily digests forcing you to click through to the site and never, ever, catching your eye.


Of course it would be possible to screen-scrape the data from the site and republish as RSS, maybe using a scripting language or the excellent ScaperWiki tool, but I really wanted something that anyone could use... in seconds.


Dapper To The Rescue

I began by visiting Dapper, a tool that lets you point and click and select which bits of a page you want to scrape. I began by clicking on the images of the news articles at the top.



After a little fiddling, you can choose whether you want that data in RSS or CSV or even as a Google Map. ( It really does take some fiddling and pruning to work out what you do here. Dapper is an astonishingly wonderful tool, I've never seen anything that does what it does with such elegance, but it does work once you've got your head around it. )

I could then choose to add my new RSS feed to my RSS Reader, but I actually made another Dapp that got the articles lower down the page. That now leaves me with two RSS feeds which I don't really want.

One of the "dapps" I created is here:
http://open.dapper.net/dapp-howto-use.php?dappName=CommunitiesandNetworkConnectionDapperVersion2



Yahoo Pipes To The Rescue

Yahoo Pipes is a wonderful visual tool for "piping" together different information sources and republishing it again. The pipe I created ( shown below ) looks like this and takes the two RSS feeds ( at the top ) from Dapper, joins them together ( Union ) , strips out any duplicates ( Unique ) and lastly filters out any junk posts.



The RSS feed that Yahoo Pipes creates is here:
http://pipes.yahoo.com/pipes/pipe.run?_id=10c40fa02b113c58042af74deead0c1a&_render=rss

And it looks a bit like this:


After a few minutes configuring using point and click tools, I can now keep in touch with the news from the site from my news reader. 

Thursday, 16 May 2013

Document Sidebar and Menus in Google Docs

For a long time I've been saying that Google need to make sure that their left hand knows what the right is doing. There are pockets of innovation in some products that are notably missing from other products. And whilst I don't advocate insisting that each department should consult with each other department before a beautiful carbuncle can be brought into the world, it's nice when you feel that different Google teams are at least on speaking terms.

One example ( and there are dozens and dozens ) of innovation insularity is the features that are in a Google Spreadsheet. With a little code and ingenuity you can add menus, and pop ups and whole new interfaces to a spreadsheet, but this can't be done in a Google Document. "Why not?" you might ask. And rightly so.

So, I was very pleased when Google announced at Google I/O ( their big developers conference happening right now ) that the features from Spreadsheets were being added to Google Documents. Take a look at this document. It has custom menus and a custom sidebar.


Imagine...

… what you could use those menus for. Imagine what content, or links, or even more clever stuff you might have in the sidebar. It could be pre-created content, by you the lecturer or it could be content that is derived from whatever you are typing. The menus might remind you when the essay deadline was, or email the document to random peers. I've already made some code to save a Google Document to Blogger that saves the images into a publicly hosted GDrive folder. This article was written in Google Docs, not Blogger which is a bit shonky nowadays. Imagine how easy it will be to add a "Save to Blogger" menu item. This is exciting stuff.


Except hold your excitable horses.

The only problem is, that the code to do the things you can do in a Google Spreadsheet is completely different to the code you need to do to do it in a Google Document. The code below shows how to add a menu and a sidebar to your Google Document ( go to the Tools >> Script Editor menu and paste it in, then run onOpen. You will need to authorize it ).

function onOpen() {

  // Display a sidebar with custom UiApp content.

  var uiInstance = UiApp.createApplication()
  .setTitle('My UiApp Sidebar')
  .setWidth(250);
  uiInstance.add(uiInstance.createLabel('This side bar can contain content that is pre-defined by the lecturer.'));
  uiInstance.add(uiInstance.createLabel("It might contain some regular links to useful stuff.

"))
  uiInstance.add(uiInstance.createAnchor('It might contain a link to plagiarism', 'http://www.york.ac.uk/integrity/plagiarism.html'))
  uiInstance.add(uiInstance.createLabel("The really interesting part, for me, is that it might contain dynamically generated links to stuff of relevence to what you are writing. "))
  uiInstance.add(uiInstance.createLabel("What if a document had links to required reading, or a link to your mentor?"))
  uiInstance.add( uiInstance.createImage('https://si0.twimg.com/profile_images/3457614642/d6b665b4c213fe02ec28dc3e94db6e0b.jpeg').setHeight(100).setWidth(100))

  DocumentApp.getUi().showSidebar(uiInstance);
  DocumentApp.getUi()


.createMenu('My Menu')
  .addItem('Do something...', 'do_something')
  .addSeparator()
  .addSubMenu(DocumentApp.getUi().createMenu('My Submenu')
              .addItem('One Submenu Item', 'do_something_else')
              .addItem('Another Submenu Item', 'do_nothing'))
  .addToUi();


function do_something() {
  DocumentApp.getUi().alert("This could do something")

}

function do_something_else() {
  DocumentApp.getUi().alert("This does something too. It could go and get live weather data and insert it into the document. Ha ha!")
  /* THIS DOESN'T WORK
  Browser.msgBox("This does something too. It could go and get live weather data and insert it into the document. Ha ha!")
  */
}

function do_nothing() {

  DocumentApp.getUi().alert("This does something. It could be anything. It might mail this doc to someone, or add it as a reference to a spreadsheet.")


}



I tried hacking the code into a Spreadsheet. It did sort of work. The sidebar appears as a popup dialog, but the menus don't work at all.



Whilst initially really encouraged that Google were "pulling the strands together" at last and making cool behaviours consistent across their product range, I am a little disappointed that it isn't completely idiot proof ( I have a stakeholder interest in things being idiot proof ).

My guess is that this is a sign of things to come, that the process of innovation consistency is just starting and one day I will be able to add Apps Script to my Drawings and Presentations and whatever else from Google.

Thursday, 9 May 2013

Information Freedom Fighting


My eye caught the City of York Council announcing that they publish all the "Freedom of Information" requests as PDFs ( here ).

The sharp-eyed amongst you will spot that the requests are organised by weeks. Each week's requests are stored in a PDF for that week. Each PDF would need clicking through to that week, then clicking through to that page and then downloading separately (using the handy "Download Now" link ) and then reading. The search engine is pretty hopeless and can't just return FOI requests and so gives you hundreds of results for any query.

Organising FOI requests by week is completely ridiculous, almost as ridiculous as ordering them by the number of words used or alphabetically. Now of course it probably makes sense from the point of view of compiling the requests - it sounds like a "once a week" job for somebody, but to then publish them once a week seems madness.

One of my pet hates is information that is made available but totally impossible to use. It's like saying, "Yes, of course you can have all the data we keep on you, we have written it on mist on these eggshells - would you like us to post it to you?". It's exactly like that.

So, one evening, I wondered if I could retrospectively do something more useful with their data. It is open, so why not.

First I made a crawler with ScraperWiki ( what an excellent tool this is ) that follows all those links and grabs the text from the PDFs. I based this on Martin's Scraper ( thanks Martin ).

I then downloaded the data collected as CSV and tried using the free statistical tool Sci2 for stemming and combining the words. Whilst this approach worked, I didn't like the resulting stemmed words, like glaz, because they just look so unfriendly.

Next, I wrote a python script to strip out stopwords like "the" and "where" etc and count the word popularity of each word in the downloaded file,  keeping track of the URL that it came from, and saved it back to a new .csv file.

from string import *
import  re, HTMLParser
def get_urls(word):
    f = '/Users/tomsmith/Downloads/pdfextractor_1.csv'
    lines = open(f).readlines()
    urls =[]
    for line in lines:    
        url = line.split(",")[0].strip()
        text = lower(line.split(",")[1].strip())
        words = text.split(" ")
        if word in words:
            urls.append( url )
    return urls
    
class MLStripper(HTMLParser.HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_fed_data(self):
        return ''.join(self.fed)
def strip_tags(html):
    #Warning this does all including script and javascript
    x = MLStripper()
    x.feed(html)
    return x.get_fed_data()
def match(s, reg):
    p = re.compile(reg, re.IGNORECASE| re.DOTALL)
    results = p.findall(s)
    return results
    
f = '/Users/tomsmith/Downloads/pdfextractor_1.csv'
lines = open(f).readlines()
stopwords = open( 'stopwords-en.txt').read().split()
d = {}
words = []
for line in lines:    
    url = line.split(",")[0].strip()
    text = line.split(",")[1].strip()
    words = text.split(" ")
    for word in words:
        
        word = lower( word )
        word = match(word, "[a-z]*")[0]
        print word
        try:
            float(word)
            word = ''
            int( word )
            word = ''
        except:
            pass
        
        if word != '' and word != '"\r\n' and word !='"' and len(word) > 1:
            if word not in stopwords:
                print word
                try:
                    d[word] += 1
                except:
                    d[word] = 1
                    
                
        
finalFreq = sorted(d.iteritems(), key=lambda t: t[1], reverse=True)
out = open("tagcloud.csv", 'w')
out.write("word,frequency\r")
for item in finalFreq:
    urls = get_urls( item[0] )
    urls_str = "|".join(urls)
    
    out.write(item[0] + "," + str(item[1]) + "," + urls_str + "\r" ) 
    print item[0], item[1], urls_str
    
out.close()

I then uploaded it as a Google Spreadsheet and turned it into a Web Application.

Problems along the way


I found that displaying 1,000 words a bit of struggle for jQuery ( maybe I made it wrong ) so ended just showing 250 at a time.

I found, of course, that the tag clouding the word by popularity only revealed that the most popular words were fairly meaningless in this context like"foi" and "february" and "council". To do this properly you also need a list of contextual stopwords... some human intervention.

Another glitch was that occassionally, a FOI request would comprise of massive tables of suppliers ( in a PDF remember ) which would skew the words to be "B&Q" and "Wickes" or "building materials" etc. I had to avoid those.


The result


And here it is. A list of words that you can browse and find direct links to the PDFs from which it came. In the end, it seems that showing 250ish words is easier on the technology and the eye. The result is something that you could maybe browse and find a link to something relevant to you.



https://script.google.com/macros/s/AKfycbxAFOZjPBNnpg60MHzFQQjb2TkTsFUSDP_oPrRdNJDg83e3eCc/exec