
brion at pobox
Jul 17, 2013, 1:22 PM
Post #5 of 6
(103 views)
Permalink
|
|
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu
[In reply to]
|
|
Yeah, Linus is kind of an asshole too. I don't see that as something to emulate. -- brion On Wed, Jul 17, 2013 at 1:10 PM, David Cuenca <dacuetu [at] gmail> wrote: > Now that you mention it... > > http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language > > Micru > > On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <brion [at] pobox> wrote: > > > I'm not sure his attitude will encourage people to work with him to his > > specifications. > > > > -- brion > > > > > > > > > > On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <dacuetu [at] gmail> wrote: > > > > > I'm forwarding this message by George Orwell III on en-ws [1]. I think > it > > > is extremely important as it offers an insight about what is wrong with > > > Djvu handling on Wikisource. > > > > > > > > > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping > coordinates) > > > because the original PHP contributing a-hole for the DjVu routine on > our > > > servers never bothered to finish the part where the internal DjVu text > > > layer is converted to a (coordinate rich) XML file using the existing > > > DjVuLibre software package because, at the time, the software had > issues. > > > > > > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago > > and > > > the issue has been long fixed now EXCEPT that the .DTD file needed to > > base > > > the plain-text to XML conversion on still has the wrong 'folder path' > on > > > local DjVuLibre installs (if this is true on server installs as well, I > > > cannot say for sure). Once I copied the folder to the [wrong] folder > > path, > > > I was able to generate the XMLs all day long. These XMLs are just like > > the > > > ones IA generates during their process (in addition to the XML that > AABBY > > > generates for them). > > > > > > "So its not that we as a community decided not to follow through with > > > (coordinate rich) XML generation but got stuck with the plain-text dump > > > workaround due to a DjVuLibre problem that no longer exists. Plus, the > > guy > > > who created the beginnings of this fabulous disaster was like tick with > > an > > > attention span deficit and moved on to conjuring up some other blasted > > > thing or another instead of following up on his own workaround & finish > > the > > > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July > > 2013 > > > (UTC) > > > > > > > > > [1] > > > > > > > > > http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext > > > > > > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail> > > > wrote: > > > > > > > Just a brief comment about djvu text layer, using IA files to digging > > > > deeper the topic. > > > > > > > > FineReader OCR stores an incredibly detailed information in a > > proprietary > > > > format; then, various FineReader versions export something of this > > > > extremely rich set of information into different outputs - one of > them > > > > being djvu text layer. It's worth to note that even if any > information > > > > stored into djvu text layer can be extracted and used, the set of > > > > information wrapped into djvu text layer (both in lisp-like format or > > in > > > > xml format) is only a minor subset of original OCR information. > > > > > > > > If someone is interested to get much more information, it can find it > > > into > > > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the > > list > > > > of exportable files. It's a very heavy and complex xml structure but > it > > > is > > > > possible to parse it, end to extract from it any information wrapped > > into > > > > djvu text layer and much more - most interestingly, wortPenalty, that > > is, > > > > word by word, the resume of degree of incertainty of OCR recognition > of > > > the > > > > whole word. > > > > > > > > We (I and Aarti) are digging into this mess, with fast preliminary > > > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some > > brief > > > > pieces of text extracted from abbyy.gx, where doubtful words (in the > > > > opinion of OCR software) are red. They can be easily managed by > > > > VisualEditor - caming simply from a simple span tag. > > > > > > > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage > will > > > > run, it would be possible to extract text by bot from abbyy.gz (if > the > > > work > > > > comes from IA) and to upload such text as OCR. > > > > > > > > Alex > > > > > > > > > > > > > > > > 2013/7/16 David Cuenca <dacuetu [at] gmail> > > > > > > > >> Hi Aubrey, > > > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he > worked > > on > > > >> the djvu text extraction/merging and he was interested in > following-up > > > on > > > >> that. Maybe he has some fresh ideas about it. > > > >> > > > >> Micru > > > >> > > > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni < > > > zanni.andrea84 [at] gmail>wrote: > > > >> > > > >>> Hi David, Aarti, thibaud and Tpt, > > > >>> please look at this thread: > > > >>> > > > >>> > > > > > > http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext > > > >>> especially the last message. > > > >>> > > > >>> It seems George Orwell III knows his stuff about Djvu and Proofread > > > >>> extension, > > > >>> and it's probably worth digging into this "layer text" djvu thing. > > > >>> > > > >>> Even if I might dream of an ideal solution (a "layered structure" > for > > > >>> wikisource, in which text can marked up several times in different > > > layers) > > > >>> that is probably very far away. > > > >>> > > > >>> But it's still important to pave the way for further improvements, > I > > > >>> guess: > > > >>> losing all the information from a formatted, mapped IA djvu it's > not > > a > > > >>> good thing to do, IMHO. > > > >>> And the Visual Editor could help us, in the future, to keep some of > > > that > > > >>> information (italics, bold, etc.) > > > >>> > > > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do > > > >>> something with it? > > > >>> > > > >>> Aubrey > > > >>> > > > >> > > > >> > > > >> > > > >> -- > > > >> Etiamsi omnes, ego non > > > >> _______________________________________________ > > > >> Wikisource-l mailing list > > > >> Wikisource-l [at] lists > > > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > > >> > > > >> > > > > > > > > _______________________________________________ > > > > Wikisource-l mailing list > > > > Wikisource-l [at] lists > > > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l > > > > > > > > > > > > > > > > > -- > > > Etiamsi omnes, ego non > > > _______________________________________________ > > > MediaWiki-l mailing list > > > MediaWiki-l [at] lists > > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > > > > _______________________________________________ > > MediaWiki-l mailing list > > MediaWiki-l [at] lists > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > > > > > > -- > Etiamsi omnes, ego non > _______________________________________________ > MediaWiki-l mailing list > MediaWiki-l [at] lists > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l > _______________________________________________ MediaWiki-l mailing list MediaWiki-l [at] lists https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
|