Adding a custom library and a semi-real application – Indicators notebook(s)

So the indicators notebooks/platform is on github. The one and only bit of analysis is almost completely useless and still requires a fair bit of set up code. The aims in this post are

  1. Add in a custom library for connecting to the data source.
  2. Add an indicator/notebook that does something kind of useful.

Hopefully, this will lay the ground work to start converting old Perl-based stuff into this new environment and start doing something more useful.

Custom library

The aim here is to replace all of the following

import json
 
with open('../config.json') as f:
    conf = json.load(f)
 
from sqlalchemy.engine.url import URL 
from sqlalchemy import create_engine
 
engine = create_engine(URL(**conf))

To something like

import Indicators
 
Indicators.connect();

or something slightly more in the correct Python idiom.

That’s done. Will still need to be refined

  1. Better way to specify the path of the config file.

    Hard coded in a file in git is not a great start.

  2. Does it fit with the Python idiom/approach?

Something a little real

Aim here is to do something a little useful to people (or at least me) and to start playing with the visualisation options.

A need I’ve identified in my new role is to have some overall idea of the number of courses, number of teaching staff, number of students etc at my institution. There doesn’t seem to be any easy way to find out and nobody I talk to knows (with a few exceptions).

Aim here is to develop a notebook that shows the number of courses in Moodle per semester.

Lesson learned: In Python, when doing SQL using like and a wildcard – typically % – you need to use %%. As Python reads % as string formatting i.e.

shortname LIKE 'EDC3100_2015_%%'

Years and terms

The first question is how to capture the individual years and terms that I might want to capture individual data for.

I could hard-code this into the notebook, but it will be different at another insitution – or a different data set. So I’m going to try a kludge, add the data to the JSON config file. Like this

  "allYears" : [ 2012, 2013, 2014, 2015 ],
  "allTerms" : [ "2012_1", "2012_2", "2012_3", 
                 "2013_1", "2013_2", "2013_3",
                 "2014_1", "2014_2", "2014_3",
                 "2015_1", "2015_2", "2015_3" ]

This is ugly and will need to be revised, but I’m in a hurry.

Though this raises the question as to whether or not I can access the data now it’s in the Indicators module.

That exploration leads to an additional function in Indicators module to get this variable. This is probably how the problem with moodle prefixes will get fixed.

Yep done. Able to include a prefix in queries. The value is defined in a new config file lms.conf which look slike

{
  "allYears" : [ 2012, 2013, 2014, 2015 ],
  "allTerms" : [ "2012_1", "2012_2", "2012_3",
                 "2013_1", "2013_2", "2013_3",
                 "2014_1", "2014_2", "2014_3",
                 "2015_1", "2015_2", "2015_3" ],
  "mdl_prefix" : "moodle.mdl_"
}

Using the prefix in code looks like

import Indicators
import pandas as pd
 
engine = Indicators.connect()
configuration = Indicators.config()
prefix = configuration['mdl_prefix']
 
query = "select id,username,firstname,lastname from " + prefix + "user where id<30 "
df = pd.read_sql(query,engine)

Segue – groking data frames

I’m still very new to Python and tend to bring my Perl/PHP frames to programming. Need to spend some time groking “the Python way”. In writing this script it’s become obvious I haven’t yet grokked data frames. Hence reading this on data frames and the following.

Actually, I found this from datacamp.com not at all easily accessible, but there are some nuggets there.

Indexing of data frames has a number of different ways to access elements. iloc is the standard array approach i.e. based on position. indexes can also be more hash like.

Mmmm, more work to do.

The kludgy solution

Have added a Course Offerings notebook that includes code like the following that will produce a simple histogram showing number of courses for each year/term within the database.

This code is the year portion. The term graph is almost identical

yearCounts = {}
 
for year in configuration['allYears']:
    query = "select count(id) from " + prefix + "course where " +\
             " shortname like '%%_" + str(year) + "_%%'"
    df = pd.read_sql( query, engine)
    yearCounts[year] = df.loc[0]
 
counts = pd.DataFrame.from_dict( yearCounts,orient='index')
 
counts.plot(kind='bar')

The code for terms generates output like the following

Course per term

Still quite ugly, there are ways to improve the output. A later task. Along with much more learning about Python etc.

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php