aboutsummaryrefslogtreecommitdiff
blob: debf139e8a05f5e02b754929f5670c60ebd9dbf4 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
This is a weekly progress report no. 2 for Project Grumpy.

As reported previously, I am building a system to index portage packages
and related metadata to make package maintainership a bit easier for
developers.

First, a few words about the document metadata storage. For this project, the
plan is to use a document-oriented and schema-free database (MongoDB) instead
of a regular relational database system (like SQLite or PostgreSQL).

This also means that we can create a single document collection, where
documents correspond to simply "category/package" and collection containing
whole ebuild tree.

Document itself in the collection, is just a JSON-formatted dictionary with
following structure (beware, this is work in progress, so some things are
still missing)::

	{
		# "package/category" (primary index, unique)
		'_id'			: string,

		# Version of the schema, used internally (just in case)
		'schema_ver'	: integer,

		# Package category
		'cat'			: string,

		# Package name
		'pkg'			: string,

		## Data from metadata.xml
		# List of herds maintaining this package
		'herds'			: [ string, ... ],
		# Long description of the package
		'ldesc' 		: string,
		# List of maintainers (by email addresses)
		'maintainers' 	: [ string, ... ],

		## Data from ebuilds itself (but should be general)
		# Description
		"desc"			: string,
		# Upstream url(s) (FIXME: Do we need list here?)
		'homepage'		: string,

		# Array of all the package versions and their specific info 
		'ebuilds' 	: [
			# Package version (from category/package-version)
		  	'version'	: string,

			# Eapi version
			"eapi" 		: integer,
			# List of USE flags supported by this ebuild
			'iuse'		: [ string, ... ],
			# Package keywords ("x86", "~amd64", ...)
			'keywords	: [ string, ... ],
			# Licenses
			'licence'	: [ string, ... ],
			# Package slot
		  	'slot'		: string,

			# Need to figure out proper structure for these, so we can also
			# map out USE flags ;)
			'depend'	: TODO!!!
			'rdepend'	: TODO!!!
		]
	}

So how about querying the data? That's easy. (Please note we are using MongoDB
shell). So, what if a developer wants to know which packages he is supposedly 
maintaining::

	> db.ebuilds.find({'maintainers' : '...@gentoo.org' })
	{... document data ...} # (Too much info :) )
	> db.ebuilds.find({'maintainers' : '...@gentoo.org' }).count()
	7

And the results come fast. I mean really fast.
Ok, how about checking how many packages under 'dev-python' are using specific
EAPI version::

	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 0}).count()
	202
	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 1}).count()
	3
	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 2}).count()
	255
	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 3}).count()
	125
	> db.ebuilds.find({'cat' : 'dev-python', 'ebuilds.eapi' : 4}).count()
	0
	> db.ebuilds.find({'cat' : 'dev-python' }).count()
	504
	> 202+3+255+125 - 504
	81

Ahem.. looks like we have a "design issue" with our document structure. So
back to the drawing board.

Last week's progress report
===========================

Last week's progress has been a bit slow, I have mostly played with document
structure and played a bit with pkgcore's internals. Although I now have
portage contents inside the database the document structure itself is far from
ideal (as you can see from the example with EAPI counts given earlier).

I have committed some of the stuff I have been working on into Grumpy's repo,
so in case you are interested check it out from [1].

[1] http://git.overlays.gentoo.org/gitweb/?p=proj/grumpy.git;a=summary

First a warning, the portage->mongodb syncer is slow. I mean really slow - it
takes about 3 hours (or even more) on my laptop to fully scan the contents of
portage and store the data in database.

Plans for current week 
======================

1) Speed up the portage syncer
2) Improve document structure