Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Interchange: users

How would you search, store, and display documents

 

 

Interchange users RSS feed   Index | Next | Previous | View Threaded


paul at gishnetwork

Aug 12, 2011, 12:23 AM

Post #1 of 4 (278 views)
Permalink
How would you search, store, and display documents

I'm tasked with building a pretty complex training/educational system for
one of my clients. This would be bring their paper manual and assets, html
newsletters, FAQ, videos, how-to's, etc, etc all into one intuitive
"knowledgebase" if you will.

I know how I want it to work and look, but what I don't know ATM is what is
the best format to use. My main concern is the searchability and storage of
the main body of each article. This text will arbitrarily contain html for
formatting, images, div's for quotes, or tables for data, and the like
(everything will be styled with css of course)

It seems to me there are several paths...

#1 Store the text page with any html needed for the article in table and
assuming html doesn't play well with fulltext searches, work around that by
saving text-only into a second field used for searching only.

#2 Delve into xml/xsl.

#3 Create some sort of wiki parser to use in conjunction with IC. I *really*
would have liked to use Kevins system, and improved upon that, but that
doesn't seem likely.

#4 Have a parser like Kevin's made, extend it to handle images, and create a
simple online "editor" for it.

It should be noted that I don't use anything but IC and Mysql, I don't like
the headaches, worries, or distractions of some other platform. IC can do
it, so I'd rather just have IC do it.

Thank you for any advice they can lend.

Paul Jordan


_______________________________________________
interchange-users mailing list
interchange-users [at] icdevgroup
http://www.icdevgroup.org/mailman/listinfo/interchange-users


racke at linuxia

Aug 12, 2011, 12:01 AM

Post #2 of 4 (274 views)
Permalink
Re: How would you search, store, and display documents [In reply to]

On 08/12/2011 10:23 AM, Paul Jordan wrote:
>
> I'm tasked with building a pretty complex training/educational system for one of my clients. This would be bring their paper manual and assets, html newsletters, FAQ, videos, how-to's, etc, etc all into one intuitive "knowledgebase" if you will.
>
> I know how I want it to work and look, but what I don't know ATM is what is the best format to use. My main concern is the searchability and storage of the main body of each article. This text will arbitrarily contain html for formatting, images, div's for quotes, or tables for data, and the like (everything will be styled with css of course)
>
> It seems to me there are several paths...
>
> #1 Store the text page with any html needed for the article in table and assuming html doesn't play well with fulltext searches, work around that by saving text-only into a second field used for searching only.

Fulltext search engines like Lucene are able to parse HTML and adjust the weight according to the position of the words (HTML title, etc).

>
> #2 Delve into xml/xsl.
>
> #3 Create some sort of wiki parser to use in conjunction with IC. I *really* would have liked to use Kevins system, and improved upon that, but that doesn't seem likely.

I hacked on Wiki stuff based on Wiki::Toolkit, it is inside the WellWell repository. If you have already Wiki formatted text, you can use this for HTML formatting
and display:

http://git.icdevgroup.org/?p=wellwell.git;a=blob;f=lib/Vend/Wiki.pm;h=8cb8499ef669ed60d26867db04a41cfba3a641e8;hb=HEAD

>
> #4 Have a parser like Kevin's made, extend it to handle images, and create a simple online "editor" for it.
>

An editor for Wiki text shouldn't be hard to come up with.

Regards
Racke


--
LinuXia Systems => http://www.linuxia.de/
Expert Interchange Consulting and System Administration
ICDEVGROUP => http://www.icdevgroup.org/
Interchange Development Team


_______________________________________________
interchange-users mailing list
interchange-users [at] icdevgroup
http://www.icdevgroup.org/mailman/listinfo/interchange-users


paul at gishnetwork

Aug 12, 2011, 8:25 AM

Post #3 of 4 (274 views)
Permalink
Re: How would you search, store, and display documents [In reply to]

> Racke mentioned on Friday, August 12, 2011...
> To: interchange-users [at] icdevgroup
> Subject: Re: [ic] How would you search, store, and display documents
>
> On 08/12/2011 10:23 AM, Paul Jordan wrote:
> >
> > I'm tasked with building a pretty complex training/educational system
for
> one of my clients. This would be bring their paper manual and assets, html
> newsletters, FAQ, videos, how-to's, etc, etc all into one intuitive
> "knowledgebase" if you will.
> >
> > I know how I want it to work and look, but what I don't know ATM is
> > what is the best format to use. My main concern is the searchability
> > and storage of the main body of each article. This text will
> > arbitrarily contain html for formatting, images, div's for quotes, or
> > tables for data, and the like (everything will be styled with css of
> > course)
> >
> > It seems to me there are several paths...
> >
> > #1 Store the text page with any html needed for the article in table and
> assuming html doesn't play well with fulltext searches, work around that
by
> saving text-only into a second field used for searching only.
>
> Fulltext search engines like Lucene are able to parse HTML and adjust the
> weight according to the position of the words (HTML title, etc).
>
> >
> > #2 Delve into xml/xsl.
> >
> > #3 Create some sort of wiki parser to use in conjunction with IC. I
*really*
> would have liked to use Kevins system, and improved upon that, but that
> doesn't seem likely.
>
> I hacked on Wiki stuff based on Wiki::Toolkit, it is inside the WellWell
> repository. If you have already Wiki formatted text, you can use this for
> HTML formatting and display:
>
> http://git.icdevgroup.org/?p=wellwell.git;a=blob;f=lib/Vend/Wiki.pm;h=8cb
> 8499ef669ed60d26867db04a41cfba3a641e8;hb=HEAD
>
> >
> > #4 Have a parser like Kevin's made, extend it to handle images, and
create
> a simple online "editor" for it.
> >
>
> An editor for Wiki text shouldn't be hard to come up with.
>
> Regards
> Racke


Thank you Racke

In my research of lucene I ran across this post on someone contemplating
exactly my issue

http://robrohan.com/2007/02/09/do-you-save-html-in-your-relational-database/

In there he proposes a pretty nifty idea - Here it is in summary:

====================================================
One solution I am kicking around is trying to write / find some sort of
text style markup language that is stored separate from the text data
(This has to exist somewhere, probably an old school Unix format,
but I am not even sure where to start looking). I am thinking it could
work something like this:

The stylesheet, in its most basic form, would be a type and
position-length pair. So for the text:

This <b>is</b> <i>example</i> text, <b>man</b>.

A parser would sniff out the tags, and make a stylesheet that could look
like:

(sheet (bold (5,2), (22,3)), (italic (8,7)) )
====================================================

I read through the comments and the only valid issue someone had about it
was regarding editing and resyncing the logistics. However, my simple
solution to that is to delete and resubmit all of this "positional
logistics" each time, thereby no needing to "adjust positions".

Not that I can build this kind of thing myself, but I think it would not be
that complicated. In fact, instead of supporting code standards, why not
just store the tag verbatim, so in this persons example it would be more
like:

5|<b>|,7|</b>|,8|<i>|,15|</i>|

Could this not be stored in a single field, then applied via a regex on
output?

My target dataset would be something like the body of a blog post. Anything
interactive would be built by IC on the page itself as the environment.

Paul










_______________________________________________
interchange-users mailing list
interchange-users [at] icdevgroup
http://www.icdevgroup.org/mailman/listinfo/interchange-users


peter at pajamian

Aug 12, 2011, 7:15 PM

Post #4 of 4 (272 views)
Permalink
Re: How would you search, store, and display documents [In reply to]

> In my research of lucene I ran across this post on someone contemplating
> exactly my issue
>
> http://robrohan.com/2007/02/09/do-you-save-html-in-your-relational-database/
>
> In there he proposes a pretty nifty idea - Here it is in summary:
>
> ====================================================
> One solution I am kicking around is trying to write / find some sort of
> text style markup language that is stored separate from the text data
> (This has to exist somewhere, probably an old school Unix format,
> but I am not even sure where to start looking). I am thinking it could
> work something like this:
>
> The stylesheet, in its most basic form, would be a type and
> position-length pair. So for the text:
>
> This <b>is</b> <i>example</i> text, <b>man</b>.
>
> A parser would sniff out the tags, and make a stylesheet that could look
> like:
>
> (sheet (bold (5,2), (22,3)), (italic (8,7)) )
> ====================================================
>
> I read through the comments and the only valid issue someone had about it
> was regarding editing and resyncing the logistics. However, my simple
> solution to that is to delete and resubmit all of this "positional
> logistics" each time, thereby no needing to "adjust positions".
>
> Not that I can build this kind of thing myself, but I think it would not be
> that complicated. In fact, instead of supporting code standards, why not
> just store the tag verbatim, so in this persons example it would be more
> like:
>
> 5|<b>|,7|</b>|,8|<i>|,15|</i>|
>
> Could this not be stored in a single field, then applied via a regex on
> output?
>
> My target dataset would be something like the body of a blog post. Anything
> interactive would be built by IC on the page itself as the environment.

This just seems like way too much work and re-inventing the wheel to be
worthwhile. If you decide to just store in HTML then it is trivially
easy to get a plain-text document from it by simply parsing out the tags
(there are certainly modules that do this for you) and much preferable,
imo than trying to maintain some position data in a separate document
where you have to hope you can keep it in sync with the text and then
cobble it all back together again in order to display the formatted version.

As for searchability, there are search engines (such as swish-e) that
can parse and index the files right from the HTML version. You just set
up an indexing run from a cron job and run the search through the engine.


Peter

_______________________________________________
interchange-users mailing list
interchange-users [at] icdevgroup
http://www.icdevgroup.org/mailman/listinfo/interchange-users

Interchange users RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.