docs/gsoc/03-report.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

This is a progress report #3 for Project Grumpy.

Now, since report two, there has been a big change of focus in the course of
development, which means that we decided to drop our beloved and also greatly
hated NoSQL approach (MongoDB) and instead go forward using regular RDBMS
which in our case is good old PostgreSQL.

Although there were some compelling arguments (ease of use being my
favorable) for MongoDB, the biggest nail in its coffing was its lack of
"support" for it from Gentoo's infra team. For them it was just another
application they would have to take care of and around interwebs there's lots
of 'MongoDB ate my data' reports on how error-prone MongoDB actually is
(although data volumes in most of these cases were so high, that I cannot
really imagine Grumpy running into these problem). But I can really
understand their concerns. Besides, if you take a look at list of commits in
MongoDB's official development repository [1], you can see why people are a
bit concerned ;)

[1] http://github.com/mongodb/mongo/commits

Therefore we switched over to PostgreSQL, using SQLAlchemy as a glue layer
between the database and application. SQLAlchemy is a blessing because using
its object relational model, you do not actually have to write any SQL (just
take a peek in the 'grumpy_sync' utility).

Progress so far
===============

So far I have implemented portage -> database sync utility that is used to
keep database in sync with portage content. Although it seems to handle most
of the various portage quirks (like package moves via 'profiles/sync'), it
still might run into issues in some corner cases and there is also minimal
error recovery: it is currently designed to crash with RuntimeError when it
detects something out of ordinary.

Of course, the data model is far from complete - no proper handling of
keywords, and I do not even store ebuild depends, rdepends and licenses in
database - mainly because I currently don't have any use cases for these.

Syncer can be found under 'utils' directory in the project directory.

Future plans
============

As model and controller are ready, next stop is to write rudimentary web app
for browsing portage contents, so people can finally see that I actually
haven't slacked all this time.. :)

Also, during portage import I noticed some really simple QA issues like
invalid herd names in 'metadata.xml'. Plan is to write a 'herdcheck' plugin
and implement database storage for these QA issues. And as I cannot let
anyone to simply write to database, I need to implement API to let plugins
interact with app.

Having API means that I can start integrating with other QA tools around
there, mainly tinderbox.

And finally, testing. I currently have simple doctesting and auditing (via
PyFlakes) framework in place, but general unit testing is still missing.

As you can see, I'm a bit lagging my proposed timeline - I still haven't
actually started looking how to create the 30-day stabilisation and upstream
version checkers, but hopefully I can pick up the speed because I can now say
that I have passed the biggest hurdle.. :)

And I have also dropped my 'secret agenda' of documenting my experience with
NoSQL databases as a series of articles written during this project...

Project info
============

Git repository of Grumpy repo is available from [2].

[2] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary

Project's semi-official IRC channel is #gentoo-grumpy on Freenode network,
if you run into troubles when testing out this project, then just ping me with
a message.

PS. Bonus points for those who noticed that I dropped 'weekly' ;)