wlee at wikia-inc
Dec 5, 2011, 11:08 AM
Post #10 of 12
Thanks to everyone for your feedback about this plan.
Re: Proposal for new table image_metadata
[In reply to]
After careful consideration, we have decided to discontinue our plan. It
does not go far enough to support the XMP standard. Instead, we will use
the field Image.img_metadata for the time being.
On Thu, Dec 1, 2011 at 8:49 PM, bawolff <bawolff+wn [at] gmail> wrote:
> > Message: 7
> > Date: Thu, 1 Dec 2011 12:36:02 -0500
> > From: Chad <innocentkiller [at] gmail>
> > Subject: Re: [Wikitech-l] Proposal for new table image_metadata
> > To: Wikimedia developers <wikitech-l [at] lists>
> > Message-ID:
> > <
> CADn73rNuSX8RegdUBCeSYG8Mz1qg5SA49VAmB5eD_Y-vB-L4dw [at] mail>
> > Content-Type: text/plain; charset=UTF-8
> > On Thu, Dec 1, 2011 at 12:34 PM, William Lee <wlee [at] wikia-inc> wrote:
> > > I'm a developer at Wikia. We have a use case for searching through a
> > > metadata. This task is challenging now, because the field
> > > Image.img_metadata is a blob.
> > >
> > > We propose expanding the metadata field into a new table. We propose
> > > name image_metadata. It will have three columns: img_name, attribute
> > > (varchar) and value (varchar). It can be joined with Image on img_name.
> > >
> > > On the application side, LocalFile's load* and decodeRow methods will
> > > to be changed to support the new table.
> > >
> > > One issue to consider is the file archive. Should we replicate the
> > > table for file archive? Or serialize the data and store it in a new
> > > (something like fa_metadata)?
> > >
> > > Please let us know if you see any issues with this plan. We hope that
> > > will be useful to the MediaWiki project, and a candidate to merge back.
> > >
> > That was part of bawolff's plan last summer for GSoC when he overhauled
> > our metadata support. He got a lot of his project done, but never quite
> > to this point. Something we'd definitely like to see though!
> > -Chad
> Chad beat me to writing essentially what I was going to say. Basically
> my project ended up being more about extracting more information, and
> i didn't really touch what we did with it after we extracted.
> However, it should be noted that storing the image metadata nicely is
> a little more complicated then it appears at first glance (and that's
> mostly my fault due to stuff i added during gsoc ;)
> Basically there's 4 different types of metadata values we store (in
> terms of the types of metadata you think of when you think EXIF et al.
> We stuff other stuff into img_metadata for extra fun)
> *Normal values - Things like Shutter speed = 1/110
> *unordered array - For example we can extract a "tags" field that's an
> arbitrary list of tags, The subject field (from XMP) is an unordered
> list, etc
> *Ordered array - Not used for a whole lot Most prominent example is
> the XMP author field is supposed to be an ordered list of authors, in
> order of importance. Honestly, we could just ditch caring about this,
> and probably nobody would notice.
> *Language array - XMP and PNG text chunks support a special value
> where you can specify language alternatives. In essence this looks
> like an associative array of "lang-code" => "translation of field into
> that lang", plus a special fallback "x-default" dummy lang code.
> *Also Contact info and software fields are stored kind of weirdly....
> Thus, just storing a table of key/value pairs is kind of problematic -
> how do you store an "array" value. Additionally you have to consider
> finding info. You probably want to efficiently be able to search
> through lang values in a specific language, or for a specific property
> and not caring for the language.
> Also consider how big a metadata field can get. Theoretically it's not
> really limited, well I don't expect it to be huge, > 255 bytes of
> utf-8 seems a totally reasonable size for a value of a metadata field.
> Last of all, you have to keep in mind all sorts of stuff is stored in
> the img_metadata. This includes things like the text layer of Djvu
> files (although arguably that shouldn't be stored there...) and other
> handler specific things (OggHandler stores some very complex
> structures in img_metadata). Of course, we could just keep the
> img_metadata blob there, and simply stop using it for "exif-like"
> data, but continue using it for handler specific ugly metadata that's
> generally invisible to user [.probably a good idea. The two types of
> data are actually quite different].
> > One issue to consider is the file archive. Should we replicate the
> > table for file archive? Or serialize the data and store it in a new
> > (something like fa_metadata)?
> Honestly, I wouldn't worry about that, especially in the beginning. As
> far as i know, the only place fa_metadata/oi_metadata is used, is that
> you can request it via api (I suppose it's copied over during file
> reverts as well). I don't think anyone uses that field on archived
> images really. (maybe one day bug 26741 will be fixed and this would
> be less of a concern).
> Anyhow, I do believe it would be awesome to store this data better. I
> can definitely think of many uses for being able to efficiently query
> it. (While I'm on the subject, making lucene index it would also be
> p.s. If its helpful - some of my ideas from last year for making a new
> metadata table are at
> http://www.mediawiki.org/wiki/User:Bawolff/metadata_table and the
> . However, they're probably over-complicated/otherwise not ideal (I
> was naive back then ;). They also try and be able to encode anything
> encodable by XMP, which is most definitely a bad idea, since XMP is
> very complicated...
Wikitech-l mailing list
Wikitech-l [at] lists