Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Mediawiki

Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

 

 

Wikipedia mediawiki RSS feed   Index | Next | Previous | View Threaded


dacuetu at gmail

Jul 17, 2013, 8:12 AM

Post #1 of 6 (115 views)
Permalink
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu

I'm forwarding this message by George Orwell III on en-ws [1]. I think it
is extremely important as it offers an insight about what is wrong with
Djvu handling on Wikisource.


"We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
because the original PHP contributing a-hole for the DjVu routine on our
servers never bothered to finish the part where the internal DjVu text
layer is converted to a (coordinate rich) XML file using the existing
DjVuLibre software package because, at the time, the software had issues.

"That faulty DjVuLibre version was the equivalent of 4,317 versions ago and
the issue has been long fixed now EXCEPT that the .DTD file needed to base
the plain-text to XML conversion on still has the wrong 'folder path' on
local DjVuLibre installs (if this is true on server installs as well, I
cannot say for sure). Once I copied the folder to the [wrong] folder path,
I was able to generate the XMLs all day long. These XMLs are just like the
ones IA generates during their process (in addition to the XML that AABBY
generates for them).

"So its not that we as a community decided not to follow through with
(coordinate rich) XML generation but got stuck with the plain-text dump
workaround due to a DjVuLibre problem that no longer exists. Plus, the guy
who created the beginnings of this fabulous disaster was like tick with an
attention span deficit and moved on to conjuring up some other blasted
thing or another instead of following up on his own workaround & finish the
XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July 2013
(UTC)


[1]
http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext

On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail> wrote:

> Just a brief comment about djvu text layer, using IA files to digging
> deeper the topic.
>
> FineReader OCR stores an incredibly detailed information in a proprietary
> format; then, various FineReader versions export something of this
> extremely rich set of information into different outputs - one of them
> being djvu text layer. It's worth to note that even if any information
> stored into djvu text layer can be extracted and used, the set of
> information wrapped into djvu text layer (both in lisp-like format or in
> xml format) is only a minor subset of original OCR information.
>
> If someone is interested to get much more information, it can find it into
> abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list
> of exportable files. It's a very heavy and complex xml structure but it is
> possible to parse it, end to extract from it any information wrapped into
> djvu text layer and much more - most interestingly, wortPenalty, that is,
> word by word, the resume of degree of incertainty of OCR recognition of the
> whole word.
>
> We (I and Aarti) are digging into this mess, with fast preliminary
> results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief
> pieces of text extracted from abbyy.gx, where doubtful words (in the
> opinion of OCR software) are red. They can be easily managed by
> VisualEditor - caming simply from a simple span tag.
>
> Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> run, it would be possible to extract text by bot from abbyy.gz (if the work
> comes from IA) and to upload such text as OCR.
>
> Alex
>
>
>
> 2013/7/16 David Cuenca <dacuetu [at] gmail>
>
>> Hi Aubrey,
>> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on
>> the djvu text extraction/merging and he was interested in following-up on
>> that. Maybe he has some fresh ideas about it.
>>
>> Micru
>>
>> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <zanni.andrea84 [at] gmail>wrote:
>>
>>> Hi David, Aarti, thibaud and Tpt,
>>> please look at this thread:
>>>
>>> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
>>> especially the last message.
>>>
>>> It seems George Orwell III knows his stuff about Djvu and Proofread
>>> extension,
>>> and it's probably worth digging into this "layer text" djvu thing.
>>>
>>> Even if I might dream of an ideal solution (a "layered structure" for
>>> wikisource, in which text can marked up several times in different layers)
>>> that is probably very far away.
>>>
>>> But it's still important to pave the way for further improvements, I
>>> guess:
>>> losing all the information from a formatted, mapped IA djvu it's not a
>>> good thing to do, IMHO.
>>> And the Visual Editor could help us, in the future, to keep some of that
>>> information (italics, bold, etc.)
>>>
>>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
>>> something with it?
>>>
>>> Aubrey
>>>
>>
>>
>>
>> --
>> Etiamsi omnes, ego non
>> _______________________________________________
>> Wikisource-l mailing list
>> Wikisource-l [at] lists
>> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>>
>>
>
> _______________________________________________
> Wikisource-l mailing list
> Wikisource-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
>
>


--
Etiamsi omnes, ego non
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


brion at pobox

Jul 17, 2013, 8:36 AM

Post #2 of 6 (108 views)
Permalink
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu [In reply to]

I'm not sure his attitude will encourage people to work with him to his
specifications.

-- brion




On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <dacuetu [at] gmail> wrote:

> I'm forwarding this message by George Orwell III on en-ws [1]. I think it
> is extremely important as it offers an insight about what is wrong with
> Djvu handling on Wikisource.
>
>
> "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
> because the original PHP contributing a-hole for the DjVu routine on our
> servers never bothered to finish the part where the internal DjVu text
> layer is converted to a (coordinate rich) XML file using the existing
> DjVuLibre software package because, at the time, the software had issues.
>
> "That faulty DjVuLibre version was the equivalent of 4,317 versions ago and
> the issue has been long fixed now EXCEPT that the .DTD file needed to base
> the plain-text to XML conversion on still has the wrong 'folder path' on
> local DjVuLibre installs (if this is true on server installs as well, I
> cannot say for sure). Once I copied the folder to the [wrong] folder path,
> I was able to generate the XMLs all day long. These XMLs are just like the
> ones IA generates during their process (in addition to the XML that AABBY
> generates for them).
>
> "So its not that we as a community decided not to follow through with
> (coordinate rich) XML generation but got stuck with the plain-text dump
> workaround due to a DjVuLibre problem that no longer exists. Plus, the guy
> who created the beginnings of this fabulous disaster was like tick with an
> attention span deficit and moved on to conjuring up some other blasted
> thing or another instead of following up on his own workaround & finish the
> XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July 2013
> (UTC)
>
>
> [1]
>
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
>
> On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail>
> wrote:
>
> > Just a brief comment about djvu text layer, using IA files to digging
> > deeper the topic.
> >
> > FineReader OCR stores an incredibly detailed information in a proprietary
> > format; then, various FineReader versions export something of this
> > extremely rich set of information into different outputs - one of them
> > being djvu text layer. It's worth to note that even if any information
> > stored into djvu text layer can be extracted and used, the set of
> > information wrapped into djvu text layer (both in lisp-like format or in
> > xml format) is only a minor subset of original OCR information.
> >
> > If someone is interested to get much more information, it can find it
> into
> > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the list
> > of exportable files. It's a very heavy and complex xml structure but it
> is
> > possible to parse it, end to extract from it any information wrapped into
> > djvu text layer and much more - most interestingly, wortPenalty, that is,
> > word by word, the resume of degree of incertainty of OCR recognition of
> the
> > whole word.
> >
> > We (I and Aarti) are digging into this mess, with fast preliminary
> > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some brief
> > pieces of text extracted from abbyy.gx, where doubtful words (in the
> > opinion of OCR software) are red. They can be easily managed by
> > VisualEditor - caming simply from a simple span tag.
> >
> > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> > run, it would be possible to extract text by bot from abbyy.gz (if the
> work
> > comes from IA) and to upload such text as OCR.
> >
> > Alex
> >
> >
> >
> > 2013/7/16 David Cuenca <dacuetu [at] gmail>
> >
> >> Hi Aubrey,
> >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked on
> >> the djvu text extraction/merging and he was interested in following-up
> on
> >> that. Maybe he has some fresh ideas about it.
> >>
> >> Micru
> >>
> >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> zanni.andrea84 [at] gmail>wrote:
> >>
> >>> Hi David, Aarti, thibaud and Tpt,
> >>> please look at this thread:
> >>>
> >>>
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> >>> especially the last message.
> >>>
> >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> >>> extension,
> >>> and it's probably worth digging into this "layer text" djvu thing.
> >>>
> >>> Even if I might dream of an ideal solution (a "layered structure" for
> >>> wikisource, in which text can marked up several times in different
> layers)
> >>> that is probably very far away.
> >>>
> >>> But it's still important to pave the way for further improvements, I
> >>> guess:
> >>> losing all the information from a formatted, mapped IA djvu it's not a
> >>> good thing to do, IMHO.
> >>> And the Visual Editor could help us, in the future, to keep some of
> that
> >>> information (italics, bold, etc.)
> >>>
> >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> >>> something with it?
> >>>
> >>> Aubrey
> >>>
> >>
> >>
> >>
> >> --
> >> Etiamsi omnes, ego non
> >> _______________________________________________
> >> Wikisource-l mailing list
> >> Wikisource-l [at] lists
> >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> >>
> >>
> >
> > _______________________________________________
> > Wikisource-l mailing list
> > Wikisource-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> >
> >
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


dacuetu at gmail

Jul 17, 2013, 1:10 PM

Post #3 of 6 (104 views)
Permalink
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu [In reply to]

Now that you mention it...
http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language

Micru

On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <brion [at] pobox> wrote:

> I'm not sure his attitude will encourage people to work with him to his
> specifications.
>
> -- brion
>
>
>
>
> On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <dacuetu [at] gmail> wrote:
>
> > I'm forwarding this message by George Orwell III on en-ws [1]. I think it
> > is extremely important as it offers an insight about what is wrong with
> > Djvu handling on Wikisource.
> >
> >
> > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
> > because the original PHP contributing a-hole for the DjVu routine on our
> > servers never bothered to finish the part where the internal DjVu text
> > layer is converted to a (coordinate rich) XML file using the existing
> > DjVuLibre software package because, at the time, the software had issues.
> >
> > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago
> and
> > the issue has been long fixed now EXCEPT that the .DTD file needed to
> base
> > the plain-text to XML conversion on still has the wrong 'folder path' on
> > local DjVuLibre installs (if this is true on server installs as well, I
> > cannot say for sure). Once I copied the folder to the [wrong] folder
> path,
> > I was able to generate the XMLs all day long. These XMLs are just like
> the
> > ones IA generates during their process (in addition to the XML that AABBY
> > generates for them).
> >
> > "So its not that we as a community decided not to follow through with
> > (coordinate rich) XML generation but got stuck with the plain-text dump
> > workaround due to a DjVuLibre problem that no longer exists. Plus, the
> guy
> > who created the beginnings of this fabulous disaster was like tick with
> an
> > attention span deficit and moved on to conjuring up some other blasted
> > thing or another instead of following up on his own workaround & finish
> the
> > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July
> 2013
> > (UTC)
> >
> >
> > [1]
> >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> >
> > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail>
> > wrote:
> >
> > > Just a brief comment about djvu text layer, using IA files to digging
> > > deeper the topic.
> > >
> > > FineReader OCR stores an incredibly detailed information in a
> proprietary
> > > format; then, various FineReader versions export something of this
> > > extremely rich set of information into different outputs - one of them
> > > being djvu text layer. It's worth to note that even if any information
> > > stored into djvu text layer can be extracted and used, the set of
> > > information wrapped into djvu text layer (both in lisp-like format or
> in
> > > xml format) is only a minor subset of original OCR information.
> > >
> > > If someone is interested to get much more information, it can find it
> > into
> > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the
> list
> > > of exportable files. It's a very heavy and complex xml structure but it
> > is
> > > possible to parse it, end to extract from it any information wrapped
> into
> > > djvu text layer and much more - most interestingly, wortPenalty, that
> is,
> > > word by word, the resume of degree of incertainty of OCR recognition of
> > the
> > > whole word.
> > >
> > > We (I and Aarti) are digging into this mess, with fast preliminary
> > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some
> brief
> > > pieces of text extracted from abbyy.gx, where doubtful words (in the
> > > opinion of OCR software) are red. They can be easily managed by
> > > VisualEditor - caming simply from a simple span tag.
> > >
> > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> > > run, it would be possible to extract text by bot from abbyy.gz (if the
> > work
> > > comes from IA) and to upload such text as OCR.
> > >
> > > Alex
> > >
> > >
> > >
> > > 2013/7/16 David Cuenca <dacuetu [at] gmail>
> > >
> > >> Hi Aubrey,
> > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked
> on
> > >> the djvu text extraction/merging and he was interested in following-up
> > on
> > >> that. Maybe he has some fresh ideas about it.
> > >>
> > >> Micru
> > >>
> > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> > zanni.andrea84 [at] gmail>wrote:
> > >>
> > >>> Hi David, Aarti, thibaud and Tpt,
> > >>> please look at this thread:
> > >>>
> > >>>
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > >>> especially the last message.
> > >>>
> > >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> > >>> extension,
> > >>> and it's probably worth digging into this "layer text" djvu thing.
> > >>>
> > >>> Even if I might dream of an ideal solution (a "layered structure" for
> > >>> wikisource, in which text can marked up several times in different
> > layers)
> > >>> that is probably very far away.
> > >>>
> > >>> But it's still important to pave the way for further improvements, I
> > >>> guess:
> > >>> losing all the information from a formatted, mapped IA djvu it's not
> a
> > >>> good thing to do, IMHO.
> > >>> And the Visual Editor could help us, in the future, to keep some of
> > that
> > >>> information (italics, bold, etc.)
> > >>>
> > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> > >>> something with it?
> > >>>
> > >>> Aubrey
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Etiamsi omnes, ego non
> > >> _______________________________________________
> > >> Wikisource-l mailing list
> > >> Wikisource-l [at] lists
> > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >>
> > >>
> > >
> > > _______________________________________________
> > > Wikisource-l mailing list
> > > Wikisource-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >
> > >
> >
> >
> > --
> > Etiamsi omnes, ego non
> > _______________________________________________
> > MediaWiki-l mailing list
> > MediaWiki-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>



--
Etiamsi omnes, ego non
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


wjhonson at aol

Jul 17, 2013, 1:21 PM

Post #4 of 6 (104 views)
Permalink
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu [In reply to]

I had to moderate my response out of the belief that I would be banned from yet another wiki list by some tyrant.

By the way Kernal Dev didn't state this, one Intel developer named Sarah Sharp stated it.

But those against frank speech have to make it extreme in an effort to once again silence opposition instead of embracing this sort of speech as the only thing to drive the wagon.







-----Original Message-----
From: David Cuenca <dacuetu [at] gmail>
To: MediaWiki announcements and site admin list <mediawiki-l [at] lists>
Sent: Wed, Jul 17, 2013 1:12 pm
Subject: Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu


Now that you mention it...
http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language

Micru

On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <brion [at] pobox> wrote:

> I'm not sure his attitude will encourage people to work with him to his
> specifications.
>
> -- brion
>
>
>
>
> On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <dacuetu [at] gmail> wrote:
>
> > I'm forwarding this message by George Orwell III on en-ws [1]. I think it
> > is extremely important as it offers an insight about what is wrong with
> > Djvu handling on Wikisource.
> >
> >
> > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
> > because the original PHP contributing a-hole for the DjVu routine on our
> > servers never bothered to finish the part where the internal DjVu text
> > layer is converted to a (coordinate rich) XML file using the existing
> > DjVuLibre software package because, at the time, the software had issues.
> >
> > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago
> and
> > the issue has been long fixed now EXCEPT that the .DTD file needed to
> base
> > the plain-text to XML conversion on still has the wrong 'folder path' on
> > local DjVuLibre installs (if this is true on server installs as well, I
> > cannot say for sure). Once I copied the folder to the [wrong] folder
> path,
> > I was able to generate the XMLs all day long. These XMLs are just like
> the
> > ones IA generates during their process (in addition to the XML that AABBY
> > generates for them).
> >
> > "So its not that we as a community decided not to follow through with
> > (coordinate rich) XML generation but got stuck with the plain-text dump
> > workaround due to a DjVuLibre problem that no longer exists. Plus, the
> guy
> > who created the beginnings of this fabulous disaster was like tick with
> an
> > attention span deficit and moved on to conjuring up some other blasted
> > thing or another instead of following up on his own workaround & finish
> the
> > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July
> 2013
> > (UTC)
> >
> >
> > [1]
> >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> >
> > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail>
> > wrote:
> >
> > > Just a brief comment about djvu text layer, using IA files to digging
> > > deeper the topic.
> > >
> > > FineReader OCR stores an incredibly detailed information in a
> proprietary
> > > format; then, various FineReader versions export something of this
> > > extremely rich set of information into different outputs - one of them
> > > being djvu text layer. It's worth to note that even if any information
> > > stored into djvu text layer can be extracted and used, the set of
> > > information wrapped into djvu text layer (both in lisp-like format or
> in
> > > xml format) is only a minor subset of original OCR information.
> > >
> > > If someone is interested to get much more information, it can find it
> > into
> > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the
> list
> > > of exportable files. It's a very heavy and complex xml structure but it
> > is
> > > possible to parse it, end to extract from it any information wrapped
> into
> > > djvu text layer and much more - most interestingly, wortPenalty, that
> is,
> > > word by word, the resume of degree of incertainty of OCR recognition of
> > the
> > > whole word.
> > >
> > > We (I and Aarti) are digging into this mess, with fast preliminary
> > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some
> brief
> > > pieces of text extracted from abbyy.gx, where doubtful words (in the
> > > opinion of OCR software) are red. They can be easily managed by
> > > VisualEditor - caming simply from a simple span tag.
> > >
> > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> > > run, it would be possible to extract text by bot from abbyy.gz (if the
> > work
> > > comes from IA) and to upload such text as OCR.
> > >
> > > Alex
> > >
> > >
> > >
> > > 2013/7/16 David Cuenca <dacuetu [at] gmail>
> > >
> > >> Hi Aubrey,
> > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked
> on
> > >> the djvu text extraction/merging and he was interested in following-up
> > on
> > >> that. Maybe he has some fresh ideas about it.
> > >>
> > >> Micru
> > >>
> > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> > zanni.andrea84 [at] gmail>wrote:
> > >>
> > >>> Hi David, Aarti, thibaud and Tpt,
> > >>> please look at this thread:
> > >>>
> > >>>
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > >>> especially the last message.
> > >>>
> > >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> > >>> extension,
> > >>> and it's probably worth digging into this "layer text" djvu thing.
> > >>>
> > >>> Even if I might dream of an ideal solution (a "layered structure" for
> > >>> wikisource, in which text can marked up several times in different
> > layers)
> > >>> that is probably very far away.
> > >>>
> > >>> But it's still important to pave the way for further improvements, I
> > >>> guess:
> > >>> losing all the information from a formatted, mapped IA djvu it's not
> a
> > >>> good thing to do, IMHO.
> > >>> And the Visual Editor could help us, in the future, to keep some of
> > that
> > >>> information (italics, bold, etc.)
> > >>>
> > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> > >>> something with it?
> > >>>
> > >>> Aubrey
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Etiamsi omnes, ego non
> > >> _______________________________________________
> > >> Wikisource-l mailing list
> > >> Wikisource-l [at] lists
> > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >>
> > >>
> > >
> > > _______________________________________________
> > > Wikisource-l mailing list
> > > Wikisource-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >
> > >
> >
> >
> > --
> > Etiamsi omnes, ego non
> > _______________________________________________
> > MediaWiki-l mailing list
> > MediaWiki-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>



--
Etiamsi omnes, ego non
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l



_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


brion at pobox

Jul 17, 2013, 1:22 PM

Post #5 of 6 (104 views)
Permalink
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu [In reply to]

Yeah, Linus is kind of an asshole too. I don't see that as something to
emulate.

-- brion


On Wed, Jul 17, 2013 at 1:10 PM, David Cuenca <dacuetu [at] gmail> wrote:

> Now that you mention it...
>
> http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language
>
> Micru
>
> On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <brion [at] pobox> wrote:
>
> > I'm not sure his attitude will encourage people to work with him to his
> > specifications.
> >
> > -- brion
> >
> >
> >
> >
> > On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <dacuetu [at] gmail> wrote:
> >
> > > I'm forwarding this message by George Orwell III on en-ws [1]. I think
> it
> > > is extremely important as it offers an insight about what is wrong with
> > > Djvu handling on Wikisource.
> > >
> > >
> > > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping
> coordinates)
> > > because the original PHP contributing a-hole for the DjVu routine on
> our
> > > servers never bothered to finish the part where the internal DjVu text
> > > layer is converted to a (coordinate rich) XML file using the existing
> > > DjVuLibre software package because, at the time, the software had
> issues.
> > >
> > > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago
> > and
> > > the issue has been long fixed now EXCEPT that the .DTD file needed to
> > base
> > > the plain-text to XML conversion on still has the wrong 'folder path'
> on
> > > local DjVuLibre installs (if this is true on server installs as well, I
> > > cannot say for sure). Once I copied the folder to the [wrong] folder
> > path,
> > > I was able to generate the XMLs all day long. These XMLs are just like
> > the
> > > ones IA generates during their process (in addition to the XML that
> AABBY
> > > generates for them).
> > >
> > > "So its not that we as a community decided not to follow through with
> > > (coordinate rich) XML generation but got stuck with the plain-text dump
> > > workaround due to a DjVuLibre problem that no longer exists. Plus, the
> > guy
> > > who created the beginnings of this fabulous disaster was like tick with
> > an
> > > attention span deficit and moved on to conjuring up some other blasted
> > > thing or another instead of following up on his own workaround & finish
> > the
> > > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July
> > 2013
> > > (UTC)
> > >
> > >
> > > [1]
> > >
> > >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > >
> > > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail>
> > > wrote:
> > >
> > > > Just a brief comment about djvu text layer, using IA files to digging
> > > > deeper the topic.
> > > >
> > > > FineReader OCR stores an incredibly detailed information in a
> > proprietary
> > > > format; then, various FineReader versions export something of this
> > > > extremely rich set of information into different outputs - one of
> them
> > > > being djvu text layer. It's worth to note that even if any
> information
> > > > stored into djvu text layer can be extracted and used, the set of
> > > > information wrapped into djvu text layer (both in lisp-like format or
> > in
> > > > xml format) is only a minor subset of original OCR information.
> > > >
> > > > If someone is interested to get much more information, it can find it
> > > into
> > > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the
> > list
> > > > of exportable files. It's a very heavy and complex xml structure but
> it
> > > is
> > > > possible to parse it, end to extract from it any information wrapped
> > into
> > > > djvu text layer and much more - most interestingly, wortPenalty, that
> > is,
> > > > word by word, the resume of degree of incertainty of OCR recognition
> of
> > > the
> > > > whole word.
> > > >
> > > > We (I and Aarti) are digging into this mess, with fast preliminary
> > > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some
> > brief
> > > > pieces of text extracted from abbyy.gx, where doubtful words (in the
> > > > opinion of OCR software) are red. They can be easily managed by
> > > > VisualEditor - caming simply from a simple span tag.
> > > >
> > > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage
> will
> > > > run, it would be possible to extract text by bot from abbyy.gz (if
> the
> > > work
> > > > comes from IA) and to upload such text as OCR.
> > > >
> > > > Alex
> > > >
> > > >
> > > >
> > > > 2013/7/16 David Cuenca <dacuetu [at] gmail>
> > > >
> > > >> Hi Aubrey,
> > > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he
> worked
> > on
> > > >> the djvu text extraction/merging and he was interested in
> following-up
> > > on
> > > >> that. Maybe he has some fresh ideas about it.
> > > >>
> > > >> Micru
> > > >>
> > > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> > > zanni.andrea84 [at] gmail>wrote:
> > > >>
> > > >>> Hi David, Aarti, thibaud and Tpt,
> > > >>> please look at this thread:
> > > >>>
> > > >>>
> > >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > > >>> especially the last message.
> > > >>>
> > > >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> > > >>> extension,
> > > >>> and it's probably worth digging into this "layer text" djvu thing.
> > > >>>
> > > >>> Even if I might dream of an ideal solution (a "layered structure"
> for
> > > >>> wikisource, in which text can marked up several times in different
> > > layers)
> > > >>> that is probably very far away.
> > > >>>
> > > >>> But it's still important to pave the way for further improvements,
> I
> > > >>> guess:
> > > >>> losing all the information from a formatted, mapped IA djvu it's
> not
> > a
> > > >>> good thing to do, IMHO.
> > > >>> And the Visual Editor could help us, in the future, to keep some of
> > > that
> > > >>> information (italics, bold, etc.)
> > > >>>
> > > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> > > >>> something with it?
> > > >>>
> > > >>> Aubrey
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Etiamsi omnes, ego non
> > > >> _______________________________________________
> > > >> Wikisource-l mailing list
> > > >> Wikisource-l [at] lists
> > > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > > >>
> > > >>
> > > >
> > > > _______________________________________________
> > > > Wikisource-l mailing list
> > > > Wikisource-l [at] lists
> > > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > > >
> > > >
> > >
> > >
> > > --
> > > Etiamsi omnes, ego non
> > > _______________________________________________
> > > MediaWiki-l mailing list
> > > MediaWiki-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> > >
> > _______________________________________________
> > MediaWiki-l mailing list
> > MediaWiki-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
>
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l


wjhonson at aol

Jul 25, 2013, 3:48 PM

Post #6 of 6 (89 views)
Permalink
Re: [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu [In reply to]

More on the great Sarah Sharp fiasco :)

http://www.itwire.com/opinion-and-analysis/open-sauce/60866-female-devs-outburst-against-torvalds-was-planned?utm_source=iTWire+Update&utm_campaign=86ee84691c-2012100810_8_2012&utm_medium=email&utm_term=0_0ab978d1b5-86ee84691c-35459309









-----Original Message-----
From: Wjhonson <wjhonson [at] aol>
To: mediawiki-l <mediawiki-l [at] lists>
Sent: Wed, Jul 17, 2013 1:21 pm
Subject: Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu


I had to moderate my response out of the belief that I would be banned from yet another wiki list by some tyrant.

By the way Kernal Dev didn't state this, one Intel developer named Sarah Sharp stated it.

But those against frank speech have to make it extreme in an effort to once again silence opposition instead of embracing this sort of speech as the only thing to drive the wagon.







-----Original Message-----
From: David Cuenca <dacuetu [at] gmail>
To: MediaWiki announcements and site admin list <mediawiki-l [at] lists>
Sent: Wed, Jul 17, 2013 1:12 pm
Subject: Re: [MediaWiki-l] [Wikisource-l] Proofread extension "extraction" of OCR text in Djvu


Now that you mention it...
http://linux.slashdot.org/story/13/07/15/2316219/kernel-dev-tells-linus-torvalds-to-stop-using-abusive-language

Micru

On Wed, Jul 17, 2013 at 11:36 AM, Brion Vibber <brion [at] pobox> wrote:

> I'm not sure his attitude will encourage people to work with him to his
> specifications.
>
> -- brion
>
>
>
>
> On Wed, Jul 17, 2013 at 8:12 AM, David Cuenca <dacuetu [at] gmail> wrote:
>
> > I'm forwarding this message by George Orwell III on en-ws [1]. I think it
> > is extremely important as it offers an insight about what is wrong with
> > Djvu handling on Wikisource.
> >
> >
> > "We/you are losing the X-min, Y-min, X-Max & Y-max (mapping coordinates)
> > because the original PHP contributing a-hole for the DjVu routine on our
> > servers never bothered to finish the part where the internal DjVu text
> > layer is converted to a (coordinate rich) XML file using the existing
> > DjVuLibre software package because, at the time, the software had issues.
> >
> > "That faulty DjVuLibre version was the equivalent of 4,317 versions ago
> and
> > the issue has been long fixed now EXCEPT that the .DTD file needed to
> base
> > the plain-text to XML conversion on still has the wrong 'folder path' on
> > local DjVuLibre installs (if this is true on server installs as well, I
> > cannot say for sure). Once I copied the folder to the [wrong] folder
> path,
> > I was able to generate the XMLs all day long. These XMLs are just like
> the
> > ones IA generates during their process (in addition to the XML that AABBY
> > generates for them).
> >
> > "So its not that we as a community decided not to follow through with
> > (coordinate rich) XML generation but got stuck with the plain-text dump
> > workaround due to a DjVuLibre problem that no longer exists. Plus, the
> guy
> > who created the beginnings of this fabulous disaster was like tick with
> an
> > attention span deficit and moved on to conjuring up some other blasted
> > thing or another instead of following up on his own workaround & finish
> the
> > XML coding portion once DjVuLibre glitch was fixed. -- 15:16, 15 July
> 2013
> > (UTC)
> >
> >
> > [1]
> >
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> >
> > On Wed, Jul 17, 2013 at 6:57 AM, Alex Brollo <alex.brollo [at] gmail>
> > wrote:
> >
> > > Just a brief comment about djvu text layer, using IA files to digging
> > > deeper the topic.
> > >
> > > FineReader OCR stores an incredibly detailed information in a
> proprietary
> > > format; then, various FineReader versions export something of this
> > > extremely rich set of information into different outputs - one of them
> > > being djvu text layer. It's worth to note that even if any information
> > > stored into djvu text layer can be extracted and used, the set of
> > > information wrapped into djvu text layer (both in lisp-like format or
> in
> > > xml format) is only a minor subset of original OCR information.
> > >
> > > If someone is interested to get much more information, it can find it
> > into
> > > abbyy.xml output; and Internet Archive gives it as abbyy.gz into the
> list
> > > of exportable files. It's a very heavy and complex xml structure but it
> > is
> > > possible to parse it, end to extract from it any information wrapped
> into
> > > djvu text layer and much more - most interestingly, wortPenalty, that
> is,
> > > word by word, the resume of degree of incertainty of OCR recognition of
> > the
> > > whole word.
> > >
> > > We (I and Aarti) are digging into this mess, with fast preliminary
> > > results; you can see into [[it:w:Utente:Alex brollo/Sandbox]] some
> brief
> > > pieces of text extracted from abbyy.gx, where doubtful words (in the
> > > opinion of OCR software) are red. They can be easily managed by
> > > VisualEditor - caming simply from a simple span tag.
> > >
> > > Now, I'm waiting dor Aarti work; as soon a VisualEditor for nsPage will
> > > run, it would be possible to extract text by bot from abbyy.gz (if the
> > work
> > > comes from IA) and to upload such text as OCR.
> > >
> > > Alex
> > >
> > >
> > >
> > > 2013/7/16 David Cuenca <dacuetu [at] gmail>
> > >
> > >> Hi Aubrey,
> > >> Thanks for the heads-up, I have CC'ed Sébastien from fr-ws, he worked
> on
> > >> the djvu text extraction/merging and he was interested in following-up
> > on
> > >> that. Maybe he has some fresh ideas about it.
> > >>
> > >> Micru
> > >>
> > >> On Tue, Jul 16, 2013 at 10:24 AM, Andrea Zanni <
> > zanni.andrea84 [at] gmail>wrote:
> > >>
> > >>> Hi David, Aarti, thibaud and Tpt,
> > >>> please look at this thread:
> > >>>
> > >>>
> >
> http://en.wikisource.org/wiki/Wikisource:Scriptorium#EPUB.2FHTML_to_Wikitext
> > >>> especially the last message.
> > >>>
> > >>> It seems George Orwell III knows his stuff about Djvu and Proofread
> > >>> extension,
> > >>> and it's probably worth digging into this "layer text" djvu thing.
> > >>>
> > >>> Even if I might dream of an ideal solution (a "layered structure" for
> > >>> wikisource, in which text can marked up several times in different
> > layers)
> > >>> that is probably very far away.
> > >>>
> > >>> But it's still important to pave the way for further improvements, I
> > >>> guess:
> > >>> losing all the information from a formatted, mapped IA djvu it's not
> a
> > >>> good thing to do, IMHO.
> > >>> And the Visual Editor could help us, in the future, to keep some of
> > that
> > >>> information (italics, bold, etc.)
> > >>>
> > >>> I know Aarti spoke with Alex about abbyy.xml: is it possible to do
> > >>> something with it?
> > >>>
> > >>> Aubrey
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Etiamsi omnes, ego non
> > >> _______________________________________________
> > >> Wikisource-l mailing list
> > >> Wikisource-l [at] lists
> > >> https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >>
> > >>
> > >
> > > _______________________________________________
> > > Wikisource-l mailing list
> > > Wikisource-l [at] lists
> > > https://lists.wikimedia.org/mailman/listinfo/wikisource-l
> > >
> > >
> >
> >
> > --
> > Etiamsi omnes, ego non
> > _______________________________________________
> > MediaWiki-l mailing list
> > MediaWiki-l [at] lists
> > https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
> >
> _______________________________________________
> MediaWiki-l mailing list
> MediaWiki-l [at] lists
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
>



--
Etiamsi omnes, ego non
_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l




_______________________________________________
MediaWiki-l mailing list
MediaWiki-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Wikipedia mediawiki RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.