September 25, 2018 by Roberto Di Remigio
This is part 1 of a series of posts on designing a general input parsing library. By general input parsing library we mean that the format of the input is not fixed by the library itself.
In this post, we will describe the data structure we will use for the input.
In the first installment in this series, we said:
The structure of the input revolves around the concept of keywords and sections. Keywords are the basic entity, sections collect keywords, or other sections. Loosely speaking, keywords are the bottom layer, while sections can be recursive. We’ll detail this point later.
It is now time to give details.
We assume that the input is structured as a collection of sections and keywords. Keywords are 2-tuples (pairs) of a key and a corresponding value:
class Keyword:
def __init__(self, key, value):
self.key = key
self.value = value
Anticlimactic eh? We could have used a tuple! Or a namedtuple
! And I agree,
but this is our first attempt at implementing a data structure.
In first instance, sections are collections of keywords. Hence a dictionary would suffice to represent them. The keys in the dictionary would be the the key of the keyword, while the value would the keyword itself:
class Section:
def __init__(self, name, keywords={}):
self.name = name
self.keywords = keywords
Question 0 Why not just collect keys and values in the section dictionary? Because the keyword data type might be richer than just a simple 2-tuple. It might contain units of measure, for example. Or type information.
However, we soon come up with the brilliant idea that sections might have subsections. And subsections, sub-subsections! Ad libitum! So really the input is a sort of multi-way (or rose) tree. The root is a section, with multiple sections as nodes and keywords as leaves. And so forth.
A possible implementation of section has a name, a dictionary of keywords, and a dictionary of sections:
class Section:
def __init__(self, name, keywords={}, sections={}):
self.name = name
self.keywords = keywords
self.sections = sections
In addition, we want to serialize to a standard text format. We start with JSON since
Python has a standard library module for it. For the moment we implement a
toJSON
class method for both Keyword
and Section
, later we might decide to
do something fancier.
Thus bringing it all together:
import json
class Section:
def __init__(self, name, keywords={}, sections={}):
self.name = name
self.keywords = keywords
self.sections = sections
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__, indent=4)
class Keyword:
def __init__(self, key, value):
self.key = key
self.value = value
def toJSON(self):
return json.dumps(self, default=lambda o: o.__dict__, indent=4)
def __str__(self):
return 'Key {} with value {} and type {}'.format(self.key, self.value, type(self.value))
where we added a custom __str__
method to the Keyword
class.
For keywords:
kw = Keyword('fuffa', 1)
print(kw)
dump_kw = kw.toJSON()
print('Dump a keyword\n{}\n'.format(dump_kw))
kw_vect = Keyword('mollo', [1, 2, 3])
print(kw_vect)
dump_kw_vect = kw_vect.toJSON()
which outputs:
Key fuffa with value 1 and type <class 'int'>
Dump a keyword
{
"key": "fuffa",
"value": 1
}
Key mollo with value [1, 2, 3] and type <class 'list'>
Dump a section
{
"name": "bar",
"keywords": {
"mollo": {
"key": "mollo",
"value": [
1,
2,
3
]
}
},
"sections": {}
}_vect.toJSON()
For sections:
sect2 = Section('bar', {kw_vect.key : kw_vect}, {})
dump_sect2 = sect2.toJSON()
print('Dump a section\n{}\n'.format(dump_sect2))
sect1 = Section('foo', {kw.key : kw}, {sect2.name : sect2})
Note how sect2
is a subsection of sect1
but does not have subsections itself.
The corresponding JSON for sect2
looks as follows:
Dump a section
{
"name": "bar",
"keywords": {
"mollo": {
"key": "mollo",
"value": [
1,
2,
3
]
}
},
"sections": {}
}
Finally, the whole input is built simply as yet another section:
inpt = Section('inpt', {kw.key : kw}, {sect1.name : sect1})
dumped = inpt.toJSON()
print('Dump whole input\n{}\n'.format(dumped))
and it JSON shows no particular suprises:
Dump whole input
{
"name": "inpt",
"keywords": {
"fuffa": {
"key": "fuffa",
"value": 1
}
},
"sections": {
"foo": {
"name": "foo",
"keywords": {
"fuffa": {
"key": "fuffa",
"value": 1
}
},
"sections": {
"bar": {
"name": "bar",
"keywords": {
"mollo": {
"key": "mollo",
"value": [
1,
2,
3
]
}
},
"sections": {}
}
}
}
}
}
Get in touch! We are Roberto Di Remigio and Radovan Bast.