September 25, 2018 by Roberto Di Remigio

Redesigning Getkw _or_ reinventing the wheel: 1. The data structure

This is part 1 of a series of posts on designing a general input parsing library. By general input parsing library we mean that the format of the input is not fixed by the library itself.

In this post, we will describe the data structure we will use for the input.

In the first installment in this series, we said:

The structure of the input revolves around the concept of keywords and sections. Keywords are the basic entity, sections collect keywords, or other sections. Loosely speaking, keywords are the bottom layer, while sections can be recursive. We’ll detail this point later.

It is now time to give details.

The data structure

We assume that the input is structured as a collection of sections and keywords. Keywords are 2-tuples (pairs) of a key and a corresponding value:

class Keyword:
    def __init__(self, key, value):
        self.key = key
        self.value = value

Anticlimactic eh? We could have used a tuple! Or a namedtuple! And I agree, but this is our first attempt at implementing a data structure.

In first instance, sections are collections of keywords. Hence a dictionary would suffice to represent them. The keys in the dictionary would be the the key of the keyword, while the value would the keyword itself:

class Section:
    def __init__(self, name, keywords={}):
        self.name = name
        self.keywords = keywords

Question 0 Why not just collect keys and values in the section dictionary? Because the keyword data type might be richer than just a simple 2-tuple. It might contain units of measure, for example. Or type information.

However, we soon come up with the brilliant idea that sections might have subsections. And subsections, sub-subsections! Ad libitum! So really the input is a sort of multi-way (or rose) tree. The root is a section, with multiple sections as nodes and keywords as leaves. And so forth.

A possible implementation of section has a name, a dictionary of keywords, and a dictionary of sections:

class Section:
    def __init__(self, name, keywords={}, sections={}):
        self.name = name
        self.keywords = keywords
        self.sections = sections

In addition, we want to serialize to a standard text format. We start with JSON since Python has a standard library module for it. For the moment we implement a toJSON class method for both Keyword and Section, later we might decide to do something fancier. Thus bringing it all together:

import json


class Section:
    def __init__(self, name, keywords={}, sections={}):
        self.name = name
        self.keywords = keywords
        self.sections = sections

    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__, indent=4)

class Keyword:
    def __init__(self, key, value):
        self.key = key
        self.value = value

    def toJSON(self):
        return json.dumps(self, default=lambda o: o.__dict__, indent=4)

    def __str__(self):
        return 'Key {} with value {} and type {}'.format(self.key, self.value, type(self.value))

where we added a custom __str__ method to the Keyword class.

The data structure in action

For keywords:

kw = Keyword('fuffa', 1)
print(kw)
dump_kw = kw.toJSON()
print('Dump a keyword\n{}\n'.format(dump_kw))

kw_vect = Keyword('mollo', [1, 2, 3])
print(kw_vect)
dump_kw_vect = kw_vect.toJSON()

which outputs:

Key fuffa with value 1 and type <class 'int'>
Dump a keyword
{
    "key": "fuffa",
    "value": 1
}

Key mollo with value [1, 2, 3] and type <class 'list'>
Dump a section
{
    "name": "bar",
    "keywords": {
        "mollo": {
            "key": "mollo",
            "value": [
                1,
                2,
                3
            ]
        }
    },
    "sections": {}
}_vect.toJSON()

For sections:

sect2 = Section('bar', {kw_vect.key : kw_vect}, {})
dump_sect2 = sect2.toJSON()
print('Dump a section\n{}\n'.format(dump_sect2))

sect1 = Section('foo', {kw.key : kw}, {sect2.name : sect2})

Note how sect2 is a subsection of sect1 but does not have subsections itself. The corresponding JSON for sect2 looks as follows:

Dump a section
{
    "name": "bar",
    "keywords": {
        "mollo": {
            "key": "mollo",
            "value": [
                1,
                2,
                3
            ]
        }
    },
    "sections": {}
}

Finally, the whole input is built simply as yet another section:

inpt = Section('inpt', {kw.key : kw}, {sect1.name : sect1})

dumped = inpt.toJSON()
print('Dump whole input\n{}\n'.format(dumped))

and it JSON shows no particular suprises:

Dump whole input
{
    "name": "inpt",
    "keywords": {
        "fuffa": {
            "key": "fuffa",
            "value": 1
        }
    },
    "sections": {
        "foo": {
            "name": "foo",
            "keywords": {
                "fuffa": {
                    "key": "fuffa",
                    "value": 1
                }
            },
            "sections": {
                "bar": {
                    "name": "bar",
                    "keywords": {
                        "mollo": {
                            "key": "mollo",
                            "value": [
                                1,
                                2,
                                3
                            ]
                        }
                    },
                    "sections": {}
                }
            }
        }
    }
}

Get in touch! We are Roberto Di Remigio and Radovan Bast.