Thursday, December 26, 2013

Creating a Twitter 'Bot on Google App Engine in Python

I've been running a Twitter 'bot from my laptop using Windows Task Scheduler for the past several months, and finally decided that it's time to upload it to a server to run from the cloud. @BountyBot tweets new and interesting bounty questions from Stack Overflow several times per day. By following the instructions below, you can set up your own Twitter 'bot that runs on Google App Engine.

What you'll need to get started:
  1. Python 2.7
  2. A Google App Engine Account
  3. The Google App Engine SDK
  4. A Twitter account (with authentication credentials)
  5. A Python library for accessing the Twitter API
  6. A source of information to tweet about

What is Google App Engine?

Google App Engine is Google's cloud computing platform. It allows you to create web applications and run them on Google's existing infrastructure. GAE supports web applications written in Python, Java, PHP, and the Go programming language.  You can find a lot more information on the Google App Engine page, or keep reading for a quick guide to getting set up and posting a Python app on Google App Engine.

Setting up a Google App Engine Account

You can set up a Google App Engine Account for free.  You're only charged for the resources that your application uses (that is, if you even surpass the free service quotas), so it can be a very good low-cost alternative to traditional web hosting providers that charge a flat monthly rate, particularly for a low-resource application like a Twitter 'bot.

Go to https://accounts.google.com to sign in to the GAE dashboard with your Google account credentials. From there you can create a new application. You'll need to provide an application identifier that will be used in your application's configuration later.

Installing Python 2.7

If you don't already have Python installed, you can just download version 2.7 from the official Python download page. If you already have Python on your machine, the Google App Engine SDK installer (see next section) will check for the correct version for you. If it reports an error, you may need to download the correct version of Python, or re-install Python 2.7 in the default location for your operating system.

Installing the Google App Engine SDK

Google App Engine has its own software development kit (SDK) available for free download that allows you to  quickly get started developing your app. Choose Python from the Downloads page, then download and run the installer for your operating system. This should place a shortcut on your desktop to a program called the Google App Engine Launcher.

Deploying and Testing

Before we get to the Twitter 'bot, let's create a quick test page and deploy it to Google App Engine to make sure everything we've done so far is working. The Google App Engine Hello, World! documentation already shows how to create, test, and deploy an application from the command line. I'm going to show how to do the same simple application from the Google App Engine Launcher desktop program that we downloaded and installed in the last section.

Launch the Google App Engine Launcher program and choose Create New Applicaton... from the File menu.


You can provide a new name, or the same name you provided as an Application Identifier when you created your Google App Engine account earlier. Also provide a parent directory on your system for the project files to be stored, then click the Create button. If you go to the project directory, you'll see that several files were created for you.

  • app.yaml - The configuration file that maps URLs to handler scripts. This file also contains your unique application identifier and a version number that allows you to roll your app back to specific versions from the GAE admin console.
  • favicon.ico - The icon that will be displayed in browser tabs when your app is viewed. The Google App Engine icon is the default.
  • index.yaml - A configuration file that specifies which indexes your app uses in the App Engine datastore. Not used in this application.
  • main.py - A Python script that handles requests using the webapp2 framework.

Select the newly created application in GAE Launcher and click the Run button. If you open the log console you'll be able to see what commands are run.  When the application is running, you'll be able to go to http://localhost:8080/ in your browser and see the program's output. (If instead of a running app you get an error message at this point, you may need to go to Edit > Preferences and set the correct Python path.)

Next, click the Deploy button in GAE Launcher. You'll be prompted for your email address and password, then all of the application files will be uploaded to Google App Engine.  Once this is done, you can visit your application's public URL to view the program's output again. Congratulations, the app is now online!

Getting your Twitter Account Authentication Credentials

Before you can post tweets from your Google App Engine project, you'll need to set up some authentication credentials with your Twitter account.  Sign in to the Twitter Developers page, choose My applications from the menu at the top right, then create a new application.  You'll need to change the application type to Read and Write on the settings tab in order to give the new application access to post tweets to your Twitter account.

You'll need the Consumer key and Consumer secret from the OAuth settings section, and you'll need to create an Access token and Access token secret. Be careful! You want to keep these values secret so that other people can't use them to post status updates to your Twitter account.  I keep them in a separate properties file so that they stay out of my source code, and don't accidentally get published where people can access them. You'll see how these values are used in a later section, when we look at the BountyBot code.

Tweeting in Python

You'll need a Twitter API wrapper in order to post tweets in Python. I used tweepy when creating BountyBot because it makes posting a tweet as simple as possible. Once a status message is composed, posting it on Twitter can be done in four lines of code using tweepy, and three of those are for authentication. It doesn't get much simpler. Conveniently, tweepy is also compatible with Python 2.7.

You can download tweepy by following the instructions on the GitHub project linked above. Since it needs to be uploaded to Google App Engine in order to be used by the web app, I just copied the entire tweepy directory into the project directory for my GAE application.

What does a Twitter 'bot tweet about?

Even if it's just for your own personal amusement, you're going to want to give your 'bot something interesting to tweet about. Fortunately, there are a lot of 'bots already on Twitter to look to for inspiration.  There are 'bots that tweet weather updates, breaking news headlines, stock quotes, the price of Bitcoin, and other seemingly random facts.You're really only limited by your imagination and Twitter's 140-character post limit. Check out sites like ProgrammableWeb, Data.gov, and World Bank for thousands of data sets and APIs to use.

Stack Overflow and Bounties

The Twitter 'bot I'm going to use for this demonstration gets its information from Stack Overflow using the Stack Exchange API. Stack Overflow is a question and answer site for programmers. Professional programmers and students post questions about code that they're writing for other programmers in the community to answer. The best answers get voted up, earning reputation points for the person who posted the answer. If a question doesn't get a good answer for some time, a "bounty" of bonus reputation can be placed on the question by anyone who wants to get an answer (provided they have the extra reputation to spend on a bounty).

Bounties last for seven days, unless the person who placed the bounty awards it early. You can view all of the questions that have open bounties on the featured questions tab.  Since Stack Overflow is a very active site (thousands of questions are posted every day), it sees about 60 bounties posted per day on average. This is why it's convenient to have a Twitter 'bot that posts links to only the most interesting bounty questions (as determined by the amount of the bounty and the number of upvotes the question receives).

All of the questions and answers posted on Stack Overflow are accessible through the Stack Exchange API, including a method for returning information about questions with active bounties. The Python code we'll look at in the next section will call this API method to get all of the bounties posted in the past 8 hours.

Putting it all together

Now that all the pieces are in place, we can see how they all fit together. You can take a look at the full code for BountyBot on GitHub, and I'll explain several key points here.

The tweet_bounty.py file contains all of the updated code for BountyBot to run on Google App Engine. It follows the same basic structure as the "Hello, World!" example that we looked at earlier. The script contains a class named TweetBounty that extends the webapp2.RequestHandler class. The get method of this class is configured to handle requests.

The get method queries the Stack Exchange API for the most recently posted bounties, finds the most interesting bounty in that list, formats it into a 140-character (maximum) message, then posts that message as a status update to Twitter.

  • request_bounties - Requests a list of bounty questions from the Stack Exchange API. The most recent bounties are those that will expire in one week, so the time stamps passed to this method form an eight hour window that ends one week from the current time and date.
  • find_max - Loops through the list of bountied questions and returns the one with the highest bounty amount. Upvotes on the questions are used to break ties.
  • format_status_msg - Takes the maximum bounty question and formats it into a 140-character message for posting to Twitter. (Question title, short link to the question, bounty amount, and most relevant tags that will fit in the 140-character limit.)
  • tweet - Takes the formatted status message and posts it to the Twitter account whose authentication credentials are supplied in the settings.cfg file.
The tweet function is where the magic happens, so it's worth taking a closer look at it here.

# Update the Twitter account authorized
# in settings.cfg with a status message.
def tweet(status):
    config = ConfigParser.RawConfigParser()
    config.read('settings.cfg')
    
    # http://dev.twitter.com/apps/myappid
    CONSUMER_KEY = config.get('Twitter OAuth', 'CONSUMER_KEY')
    CONSUMER_SECRET = config.get('Twitter OAuth', 'CONSUMER_SECRET')
    # http://dev.twitter.com/apps/myappid/my_token
    ACCESS_TOKEN_KEY = config.get('Twitter OAuth', 'ACCESS_TOKEN_KEY')
    ACCESS_TOKEN_SECRET = config.get('Twitter OAuth', 'ACCESS_TOKEN_SECRET')

    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
    api = tweepy.API(auth)
    result = api.update_status(status)

The tweet function takes in status as an argument and posts it to Twitter. The first two lines of the function read a configuration file that contain the Twitter authentication credentials we set up earlier, and the next four lines read those values from the file. If we were going to reuse these values it would be worth it to pull this part of the code out into a separate function, but since this script only accesses Twitter once, we can do it all in one place.

The last four lines use the authentication credentials loaded from the file to prove to Twitter that the script has permission to post status messages on the associated Twitter account, then updates the status of that account with the status message passed in as an argument. That's all there is to it.

Scheduling tweets with cron

Now that we've got a web app up and running that can post to Twitter, all that's left is to set up a schedule for when those tweets should be posted using cron. You do this on Google App Engine by creating a cron.yaml file that specifies when you want tasks to be executed. Tasks for BountyBot in cron.yaml have the following format:

- description: daily 1PM tweet
  url: /tweet_bounty
  schedule: every day 13:00
  timezone: America/New_York

The first line is just a description and doesn't change the way cron behaves. The second line is the URL of the script that needs to be run on a schedule. It's a relative URL from the base of the application on GAE. The next line tells cron when to run the script, and the last one specifies what timezone that schedule is based in. If you leave out the timezone, GAE will assume UTC. Since I want BountyBot to post tweets three times per day, I have three entries in my cron.yaml file, one each for 5AM, 1PM, and 9PM in my timezone.

Important: Since scheduled tasks usually do things that you only want to be done on a schedule and not by users visiting the URL of the script (like posting status updates to your Twitter account, for one example), it's important to secure those scripts so that they can only be run by a site administrator (you) and the task scheduler (cron). You can secure a script by adding login: admin to its entry in your app.yaml file.

- url: /tweet_bounty
  script: tweet_bounty.app
  login: admin

You can read more about formatting tasks in cron.yaml in the GAE article Scheduled Tasks With Cron for Python.

Finally...

 If you made it this far, congratulations, you're a Google App Engine expert! Just kidding. This article really only scratches the surface when it comes to Google App Engine, but it does serve as a quick start guide that you can use to get something up and running in a weekend. Be sure to visit the Google App Engine developer's guide to find much more in-depth tutorials, sample code, and videos that explain all the features of the platform. If you run into problems, remember that you can always search or ask the experts on Stack Overflow for some help. Good luck!

Tuesday, April 30, 2013

SICP 2.66: Sets and information retrieval

From SICP section 2.3.3 Sets and information retrieval

The final part of section 2.3.3 asks us to consider a database that contains records, each of which has a key and some data. If the database is represented as an unordered list, a record can be looked up by its key using the following procedure.

(define (lookup given-key set-of-records)
  (cond ((null? set-of-records) false)
        ((equal? given-key (key (car set-of-records)))
         (car set-of-records))
        (else (lookup given-key (cdr set-of-records)))))

We can define simple procedures for making a record out of a key and its data, and for extracting the key and data from an existing record in order to test the procedure above.

(define (key record) (car record))
(define (data record) (cdr record))
(define (make-record key data) (cons key data))

(define database
  (list (make-record 1 'Bill)
        (make-record 2 'Joe)
        (make-record 3 'Frank)
        (make-record 4 'John)))
  
> (lookup 3 database)
'(3 . Frank)
> (data (lookup 1 database))
'Bill

Exercise 2.66 asks us to implement the lookup procedure for the case where the set of records is structured as a binary tree, ordered by the numerical values of the keys.

We can start by including the list->tree and partial-tree procedures given for exercise 2.64, along with a few required supporting procedures.

(define (entry tree) (car tree))
(define (left-branch tree) (cadr tree))
(define (right-branch tree) (caddr tree))
(define (make-tree entry left right)
  (list entry left right))
  
(define (list->tree elements)
  (car (partial-tree elements (length elements))))

(define (partial-tree elts n)
  (if (= n 0)
      (cons '() elts)
      (let ((left-size (quotient (- n 1) 2)))
        (let ((left-result (partial-tree elts left-size)))
          (let ((left-tree (car left-result))
                (non-left-elts (cdr left-result))
                (right-size (- n (+ left-size 1))))
            (let ((this-entry (car non-left-elts))
                  (right-result (partial-tree (cdr non-left-elts)
                                              right-size)))
              (let ((right-tree (car right-result))
                    (remaining-elts (cdr right-result)))
                (cons (make-tree this-entry left-tree right-tree)
                      remaining-elts))))))))

This makes it easier to convert the existing database to one structured as a binary tree.

> (define tree-db (list->tree database))
> tree-db
'((2 . Joe) ((1 . Bill) () ()) ((3 . Frank) () ((4 . John) () ())))

Finally, we can write the new implementation of lookup using element-of-set? as a guide.

(define (lookup given-key set-of-records)
  (cond ((null? set-of-records) #f)
        ((= given-key (key (car set-of-records)))
         (car set-of-records))
        ((< given-key (key (car set-of-records)))
         (lookup given-key (left-branch set-of-records)))
        ((> given-key (key (car set-of-records)))
         (lookup given-key (right-branch set-of-records)))))

> (lookup 3 tree-db)
'(3 . Frank)
> (lookup 1 tree-db)
'(1 . Bill)
> (lookup 5 tree-db)
#f
> (data (lookup 2 tree-db))
'Joe

For links to all of the SICP lecture notes and exercises that I've done so far, see The SICP Challenge.

Sunday, March 24, 2013

SICP 2.63 - 2.65: Sets as binary trees

From SICP section 2.3.3 Example: Representing Sets

So far in this section we've looked at two ways of representing sets. First as unordered lists, then as ordered lists. Now we'll look at how we can represent sets as binary trees, and we'll see what advantages this representation has over ordered lists.

Each node of a tree holds one element of the set, called the "entry" at that node, and a link to each of two other (possibly empty) nodes. The "left" link points to elements smaller than the one at the node, and the "right" link to elements greater than the one at the node. The same set may be represented by a number of different trees. The only requirements for a valid representation is that all elements in the left subtree must be smaller than the node entry and all elements in the right subtree be must be larger than the node entry. Figure 2.16 in the text shows several valid tree representations of the same set of values.



Recall that for an ordered set of n elements, we had to search through (on average) n/2 elements to locate a particular value. We do this by searching through the elements in order. The advantage of a tree representation is that we can cut this effort down to log n if the tree is balanced.


Exercise 2.63 asks us if the following two procedures produce the same results for every tree, and if not how they differ.

(define (tree->list-1 tree)
  (if (null? tree)
      '()
      (append (tree->list-1 (left-branch tree))
              (cons (entry tree)
                    (tree->list-1 (right-branch tree))))))
     
(define (tree->list-2 tree)
  (define (copy-to-list tree result-list)
    (if (null? tree)
        result-list
        (copy-to-list (left-branch tree)
                      (cons (entry tree)
                            (copy-to-list (right-branch tree)
                                          result-list)))))
  (copy-to-list tree '()))

The tree->list-1 procedure checks to see if the tree passed in is null, and if so returns an empty list. If the tree is not null, it creates a list by appending the left branch of the tree, the element at the root node, and the right branch of the tree. Elements of the left and right branches are flattened into lists using recursive calls to tree->list-1. The tree->list-2 procedure defines a helper function copy-to-list that takes the tree and a result-list as arguments. When the tree is null, it returns the result-list that was passed in. The copy-to-list helper function also uses recursive calls to the left and right branches of the tree while building the final result list. These two procedures will produce the same results for every tree.

We're asked to test the two procedures on the trees in figure 2.16.

(define tree1 '(7 (3 (1 () ()) (5 () ())) (9 () (11 () ()))))
(define tree2 '(3 (1 () ()) (7 (5 () ()) (9 () (11 () ())))))
(define tree3 '(5 (3 (1 () ()) ()) (9 (7 () ()) (11 () ()))))

> (tree->list-1 tree1)
'(1 3 5 7 9 11)
> (tree->list-2 tree1)
'(1 3 5 7 9 11)
> (tree->list-1 tree2)
'(1 3 5 7 9 11)
> (tree->list-2 tree2)
'(1 3 5 7 9 11)
> (tree->list-1 tree3)
'(1 3 5 7 9 11)
> (tree->list-2 tree3)
'(1 3 5 7 9 11)

We can see from these results that both procedures return an in-order traversal for every tree.

We're also asked if the two procedures have the same order of growth for a balanced tree, and if not, which one grows more slowly?

We can see from the results above and from inspecting the two procedures that each node of the tree is visited one time by each algorithm. What happens at each of those n steps is subtly different though. The second procedure simply calls cons at each step, which we'll assume is a constant-time operation, so the tree->list-2 procedure has a time complexity of $O(n)$. The first procedure calls append at each step, which we saw in section 2.2.1 has the following definition:

(define (append list1 list2)
  (if (null? list1)
      list2
      (cons (car list1) (append (cdr list1) list2))))

From this definition we can see that the order of growth of append is proportional to the first list argument that's passed in. In the case of tree->list-1, the first list argument is the left branch of the tree, which is about half of a node's elements for a balanced tree. This means that for each recursive call, approximately half of the number of nodes will be in the first list argument as in the previous call. Since the number of elements is cut in half on each of the n calls to append, the tree->list-1 procedure has a complexity of $O(n log n)$ for a balanced tree.


Exercise 2.64 introduces the list->tree procedure, which converts an ordered list to a balanced binary tree using the helper procedure partial-tree that takes as arguments an integer n and list of at least n elements and constructs a balanced tree containing the first n elements of the list.

(define (list->tree elements)
  (car (partial-tree elements (length elements))))

(define (partial-tree elts n)
  (if (= n 0)
      (cons '() elts)
      (let ((left-size (quotient (- n 1) 2)))
        (let ((left-result (partial-tree elts left-size)))
          (let ((left-tree (car left-result))
                (non-left-elts (cdr left-result))
                (right-size (- n (+ left-size 1))))
            (let ((this-entry (car non-left-elts))
                  (right-result (partial-tree (cdr non-left-elts)
                                              right-size)))
              (let ((right-tree (car right-result))
                    (remaining-elts (cdr right-result)))
                (cons (make-tree this-entry left-tree right-tree)
                      remaining-elts))))))))

First we're asked to explain how partial-tree works, then draw the tree produced by list->tree for the list (1 3 5 7 9 11).

The partial-tree procedure works by dividing the list into three parts, a center element (the root node of the tree), everything before the center element, and everything after the center element. All the elements before the center element are then passed to a recursive call to partial-tree to create the left branch of the tree, and all the elements after the center element are passed recursively to partial-tree to create the right branch. These recursive call continue until no elements are remaining, and the balanced binary tree is assembled.

The tree produced by list->tree for the list (1 3 5 7 9 11) is:



To verify this, we can simply call the procedure.

> (list->tree '(1 3 5 7 9 11))
'(5 (1 () (3 () ())) (9 (7 () ()) (11 () ())))

Next we're asked what is the order of growth in the number of steps required by list->tree to convert a list of n elements? The procedure only needs to visit each element of the list once, and it only performs a cons for each element it visits, so the number of steps is proportional to the size of the list, or $O(n)$.


Exercise 2.65 asks us to use the results of the previous two exercises to give $O(n)$ implementations of union-set and intersection-set for sets implemented as (balanced) binary trees.

We implemented union-set for the unordered list representation of sets back in exercise 2.59. This implementation had to check all elements of one set for each element of the other, so it's complexity was $O(n^2)$, quite poor. We improved on this in exercise 2.62 when we wrote an implementation of union-set for the ordered list representation of sets, which was $O(n)$. The text supplied a similar implementation of intersection-set that was also $O(n)$. We could use these ordered set implementations as a guide to writing efficient implementations of union-set and intersection-set for balanced binary trees, but that wouldn't require the results of the previous two exercises. Instead, we can use the $O(n)$ implementations of all of the procedures we've built so far to perform the following steps:

  • Convert the balanced binary trees to ordered lists.
  • Perform the desired operation (union-set or intersection-set).
  • Convert the resulting ordered set back to a balanced binary tree.

(define (union-set tree1 tree2)
  (define (union-list set1 set2)
    (cond ((null? set1) set2)
          ((null? set2) set1)
          ((= (car set1) (car set2))
           (cons (car set1) (union-list (cdr set1) (cdr set2))))
          ((< (car set1) (car set2))
           (cons (car set1) (union-list (cdr set1) set2)))
          (else (cons (car set2) (union-list set1 (cdr set2))))))
  (list->tree (union-list (tree->list-2 tree1)
                          (tree->list-2 tree2))))

(define (intersection-set tree1 tree2)
  (define (intersection-list set1 set2)
    (if (or (null? set1) (null? set2))
        '()    
        (let ((x1 (car set1)) (x2 (car set2)))
          (cond ((= x1 x2)
                 (cons x1
                       (intersection-list (cdr set1)
                                          (cdr set2))))
                ((< x1 x2)
                 (intersection-list (cdr set1) set2))
                ((< x2 x1)
                 (intersection-list set1 (cdr set2)))))))
  (list->tree (intersection-list (tree->list-2 tree1)
                                 (tree->list-2 tree2))))

In the implementations above, I've just defined the earlier ordered set implementations of union-set and intersection-set as helper functions named union-list and intersection-list. With these helper functions, all union-set and intersection-set need to do is convert from tree to list and back from list to tree. We can define a few balanced trees to test that these new implementations work as expected.

> (define evens (list->tree '(0 2 4 6 8 10)))
> (define odds (list->tree '(1 3 5 7 9)))
> (define primes (list->tree '(2 3 5 7 11 13 17 19)))
>
> evens
'(4 (0 () (2 () ())) (8 (6 () ()) (10 () ())))
> odds
'(5 (1 () (3 () ())) (7 () (9 () ())))
> primes
'(7 (3 (2 () ()) (5 () ())) (13 (11 () ()) (17 () (19 () ()))))
>
> (union-set odds evens)
'(5
  (2 (0 () (1 () ())) (3 () (4 () ())))
  (8 (6 () (7 () ())) (9 () (10 () ()))))
> (union-set odds odds)
'(5 (1 () (3 () ())) (7 () (9 () ())))
> (intersection-set evens primes)
'(2 () ())
> (intersection-set odds primes)
'(5 (3 () ()) (7 () ()))
> (intersection-set odds evens)
'()


For links to all of the SICP lecture notes and exercises that I've done so far, see The SICP Challenge.