Login | Register For Free | Help
Search for: (Advanced)

Mailing List Archive: Wikipedia: Wikitech

New Project, Link Hooks... Needing some research

 

 

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded


dan_the_man at telus

Aug 15, 2008, 3:17 AM

Post #1 of 9 (1255 views)
Permalink
New Project, Link Hooks... Needing some research

I've noticed a growing in extensions extending link syntax. (Namely
SMW's annotations, and other extensions using Embed:, Video:, or
theoretically even Audio: namespaces for embedding things).

However all implementations have strong issues. We have an internal
parsing of links, however when an extension does something it's
customary to use a regex rather than duplicating a small part of the
parser. This normally leads to either a limited syntax substandard of
what the parser does, or a regex so complex it causes server errors when
syntax is a bit broken (missing a trailing ]] ).

For that reason I'm looking into adding a new feature for the parser
Link Hooks. Basically this would allow an extension to hook into link
processing for a Namespace, or a pattern.

I plan to support a number of flags (Link/Media callbacks [link
modification, vs. embedding], namespace/pattern [ns number, or a special
pattern (like SMW's ::)], Multi-params [Pipe separated params rather
than one display text], Recursive parameters [Things like Image: where
links can be inside parameters], Recursive link text [For patterns which
break things up and may contain links]) so it should handle most cases.

Unfortunately I hit a snag in the code when dealing with
[[Embedablens:Page|Content with [[link|displaytext]] inside]]. I can't
provide data to extensions in a sane way. Either plaintext is sent to
them, and they work with that (albet breaking things like usual), or I
try to split up the |'s which doesn't work with nested things, or I
first parse the nested links, but then extensions get a hard to work
with mess passed to them as their data.

The nice way the preprocessor works with objects has pointed me out that
the best way this would work, would probably be to recursively parse the
text into link objects, and then do our expansion, also allowing them
access in special ways to the tree (Extract as WikiText, HTML, Plain Text).


Doing some research into the way the parser handles links at first
provided me with good results ([[link [[inside of]] link]] nicely gives
you a link to "inside of" with the outside stuff verbatim just as the
processor I think of would do). However I ran into an ugly, sticky, mess
with image embedding.
http://dev.wiki-tools.com/wiki/LinkHook#Old_Tests
(Ignore the fact my examples here don't have the frame option)
[[Image:File.ext|Caption]] Renders as a image with "Caption"
[[Image:File.ext|[[Image:File.ext|Caption]]]] Renders an image inside of
another image that has a caption of "Caption".
[[Image:File.ext|[[Image:File.ext|[[link]]]]]] Renders [[link]] as a
link, the rest is completely verbatim.

Honestly, the syntax is inconsistent with itself. If we were trying to
stop embeds inside of embeds, then the last one should render as an
image, with a link to [[link]] and the other Image: verbatim as a caption.

I believe there is a bug about the 2nd case, if anyone has it handy I'd
love a link. I hunted through bugzilla but couldn't find it.

Some use cases, what's expected would be nice.

My issue is that Image links are functionally supposed to be the same as
a setLinkHook using the Media, Multi-params, and Recursive parameters
options. (Embed but not with : at the start, pipe separated parameters,
and parameters can have links inside of them).
However, in terms of any extension or anything that would be using
setLinkHook, something like that making use of the recursive parameters
option would be expecting something different.
[[Embed:Title|[[Otherembed:Title]] and [[link]]]]
Would actually render as an embed, with two links (since it's inside of
another embed the 'Otherembed' reverts to a link).
And: [[Embed:Title|[[Otherembed:Title|[[link]]]]]]
Would actually render as an embed, with a link to [[link]] and the rest
of the caption verbatim.

--
~Daniel Friesen(Dantman, Nadir-Seen-Fire) of:
-The Nadir-Point Group (http://nadir-point.com)
--It's Wiki-Tools subgroup (http://wiki-tools.com)
--The ElectronicMe project (http://electronic-me.org)
--Games-G.P.S. (http://ggps.org)
-And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
--Animepedia (http://anime.wikia.com)
--Narutopedia (http://naruto.wikia.com)


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


brion at wikimedia

Aug 15, 2008, 9:21 AM

Post #2 of 9 (1188 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Daniel Friesen wrote:
> [[Image:File.ext|Caption]] Renders as a image with "Caption"
> [[Image:File.ext|[[Image:File.ext|Caption]]]] Renders an image inside of
> another image that has a caption of "Caption".
> [[Image:File.ext|[[Image:File.ext|[[link]]]]]] Renders [[link]] as a
> link, the rest is completely verbatim.
>
> Honestly, the syntax is inconsistent with itself. If we were trying to
> stop embeds inside of embeds, then the last one should render as an
> image, with a link to [[link]] and the other Image: verbatim as a caption.

Yes, links are not currently fully 'embeddable' in a recursive way. :(
You're currently allowed just one level of 'link embedding' in the
caption area for Image: links, and even that's special-cased.

It was basically stuck in as a hack on the existing link parsing, which
was optimized for doing a single flat pass of links through the entire
page; after we extended link syntax to allow image captions, there was a
need to hack it up to allow links in the captions...

If it can be more cleanly done in a way that, as it happens, lets you do
multiple levels cleanly, that would probably be just great! But
definitely try to keep it clean and consistent. :)


In general though I'm not sure we should concentrate on using link
syntax here, though; the trend these days seems to be to use parser
functions for such things.

- -- brion
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkilrQEACgkQwRnhpk1wk47wFgCg33invc1nWH9YgMPtv/inCOZR
jc4AoNtXmRScusO58z6v9/ixwjQkpN4V
=KRgM
-----END PGP SIGNATURE-----

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dan_the_man at telus

Aug 15, 2008, 1:27 PM

Post #3 of 9 (1187 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

Yes, I'm trying to keep things consistent. The only location my idea for
a processor differs from current syntax is in the insane edge cases. And
at that, only the two most unlikely of them.
I'd like to track down the bug report related to that off rendering.
I can't imagine anywhere someone would be relying on
[[Image:File.ext|[[Image:File.ext|[[link]]]]]] rendering a [[link]] and
the rest verbatim. However there might be an insane use of
[[Image:File.ext|[[Image:File.ext|Caption]]]].


Well, yes. Link syntax is not a one stop thing for use, in fact in
comparison to ParserFunctions the features are going to be substandard.
However there are some nice cases where extending a link fits the syntax
better than using a parser function. Especially with embedding the
Image: namespace, and how it could be extended to embed things like
Audio: and such.
SMW Also uses annotations, which do most of the time fit in as link like
syntax. SMW Could use an #annotate pfunc for the ugly cases, but that's
beside the case here. It would still be good to preserve the old syntax
where possible.

~Daniel Friesen(Dantman, Nadir-Seen-Fire) of:
-The Nadir-Point Group (http://nadir-point.com)
--It's Wiki-Tools subgroup (http://wiki-tools.com)
--The ElectronicMe project (http://electronic-me.org)
--Games-G.P.S. (http://ggps.org)
-And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
--Animepedia (http://anime.wikia.com)
--Narutopedia (http://naruto.wikia.com)

Brion Vibber wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Daniel Friesen wrote:
>
>> [[Image:File.ext|Caption]] Renders as a image with "Caption"
>> [[Image:File.ext|[[Image:File.ext|Caption]]]] Renders an image inside of
>> another image that has a caption of "Caption".
>> [[Image:File.ext|[[Image:File.ext|[[link]]]]]] Renders [[link]] as a
>> link, the rest is completely verbatim.
>>
>> Honestly, the syntax is inconsistent with itself. If we were trying to
>> stop embeds inside of embeds, then the last one should render as an
>> image, with a link to [[link]] and the other Image: verbatim as a caption.
>>
>
> Yes, links are not currently fully 'embeddable' in a recursive way. :(
> You're currently allowed just one level of 'link embedding' in the
> caption area for Image: links, and even that's special-cased.
>
> It was basically stuck in as a hack on the existing link parsing, which
> was optimized for doing a single flat pass of links through the entire
> page; after we extended link syntax to allow image captions, there was a
> need to hack it up to allow links in the captions...
>
> If it can be more cleanly done in a way that, as it happens, lets you do
> multiple levels cleanly, that would probably be just great! But
> definitely try to keep it clean and consistent. :)
>
>
> In general though I'm not sure we should concentrate on using link
> syntax here, though; the trend these days seems to be to use parser
> functions for such things.
>
> - -- brion
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.8 (Darwin)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkilrQEACgkQwRnhpk1wk47wFgCg33invc1nWH9YgMPtv/inCOZR
> jc4AoNtXmRScusO58z6v9/ixwjQkpN4V
> =KRgM
> -----END PGP SIGNATURE-----
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rolf.lampa at rilnet

Aug 16, 2008, 12:34 PM

Post #4 of 9 (1184 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

Daniel Friesen skrev:
> However there are some nice cases where extending a link fits the syntax
> better than using a parser function. <...>
> SMW Also uses annotations, which do most of the time fit in as link like
> syntax. SMW Could use an #annotate pfunc for the ugly cases, but that's
> beside the case here. It would still be good to preserve the old syntax
> where possible.


MW's and SMW's parsing of links isn't very impressing at the moment.

Due to the limitations of Regex for nested links, and other problems
with overly complex expressions, I've had some fun playing around with
an attempt to make a new algorithm which actually would follow desired
rules.

The algorithm should return matching tags, nested and in perfect
'balance' (not so now), any complexity, and preferably without choking
hardware.

So far I have only a Delphi* version (draft) for which I did some
functional and performance tests last night:
http://wiki.rilnet.com/wiki/Pattern_matching_of_SMW_properties/Testdata

About current Regexes used in SMW here (from a recent post by Markus
Krötzsch):
http://wiki.rilnet.com/wiki/Pattern_matching_of_SMW_properties

Regards,

// Rolf Lampa

* For Pascal->PHP conversion I extended an existing DelphiToCpp converter:
http://wiki.rilnet.com/wiki/RIL_DelphiToPHP_Converter


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


nospam at vyznev

Aug 16, 2008, 2:35 PM

Post #5 of 9 (1181 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

Rolf Lampa wrote:
>
> MW's and SMW's parsing of links isn't very impressing at the moment.

I might be talking out of my ass here, since I haven't really looked
very much at the relevant parts of the code (except to know that the
current link parsing is indeed hacky), but couldn't we somehow reuse
whatever code we currently use to parse the curly bracket syntax for
transclusion, parser functions and whatnot? After all, from a user's
viewpoint, pretty much the only syntactical difference between linking
and transclusion is the shape of the brackets.

--
Ilmari Karonen

_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rolf.lampa at rilnet

Aug 16, 2008, 6:30 PM

Post #6 of 9 (1180 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

Ilmari Karonen skrev:
> Rolf Lampa wrote:
>> MW's and SMW's parsing of links isn't very impressing at the moment.
>
> After all, from a user's
> viewpoint, pretty much the only syntactical difference between linking
> and transclusion is the shape of the brackets.


Hm, I'd say that it's about more than only keeping braces/brackets in
balance. It's also about rules about the link-content which (should)
determine whether the entire link should be skipped entirely or not (on
syntax errors). Some rules requires unique logic and awareness of context.

But as said, I haven't either looked into how the logic for braces is
structured (if it allows for _callbacks for example, such can be a risky
though).

In any case, I'm onto learning some PHP and therefore I'll continue
"playing around" with this for a while, if for no other good reason so
for my own interest. :)

Regards,

// Rolf Lampa


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dan_the_man at telus

Aug 16, 2008, 10:47 PM

Post #7 of 9 (1174 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

Timstarling commented that adding links into the preprocessor would
change they way they are handled and would break cases like
http://sandbox.wiki-tools.com/edit/FakeLink changing the syntax of
WikiText in an incompatible way.


The actual code for Parse::replaceInternalLinks is a sort of explode
engine. It starts off by exploding using [[, and it searches for ends
and such, also taking into account Image: which can have recursive stuff.
It's quite ugly, however it works. And I have been able to find good
points to run callbacks in. However my issue lies in the |, there is no
strict handling of those and making them "safe" is handled by parsing
links inside of the links before the | is broken up. Works good for the
parser, but not for anything you want to send to a callback.

So the plan is to actually build an object tree similar to the Frames
and Parts the preprocessor uses. This'll allow for better handling of
things inside of callbacks.

~Daniel Friesen(Dantman, Nadir-Seen-Fire) of:
-The Nadir-Point Group (http://nadir-point.com)
--It's Wiki-Tools subgroup (http://wiki-tools.com)
--The ElectronicMe project (http://electronic-me.org)
--Games-G.P.S. (http://ggps.org)
-And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
--Animepedia (http://anime.wikia.com)
--Narutopedia (http://naruto.wikia.com)

Rolf Lampa wrote:
> Ilmari Karonen skrev:
>
>> Rolf Lampa wrote:
>>
>>> MW's and SMW's parsing of links isn't very impressing at the moment.
>>>
>> After all, from a user's
>> viewpoint, pretty much the only syntactical difference between linking
>> and transclusion is the shape of the brackets.
>>
>
>
> Hm, I'd say that it's about more than only keeping braces/brackets in
> balance. It's also about rules about the link-content which (should)
> determine whether the entire link should be skipped entirely or not (on
> syntax errors). Some rules requires unique logic and awareness of context.
>
> But as said, I haven't either looked into how the logic for braces is
> structured (if it allows for _callbacks for example, such can be a risky
> though).
>
> In any case, I'm onto learning some PHP and therefore I'll continue
> "playing around" with this for a while, if for no other good reason so
> for my own interest. :)
>
> Regards,
>
> // Rolf Lampa
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


rolf.lampa at rilnet

Aug 17, 2008, 8:46 AM

Post #8 of 9 (1168 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

Daniel Friesen skrev:
> Timstarling commented that adding links into the preprocessor would
> change they way they are handled and would break cases like
> http://sandbox.wiki-tools.com/edit/FakeLink changing the syntax of
> WikiText in an incompatible way.

Yes, that's a good example.

> <...>However my issue lies in the |, there is no
> strict handling of those and making them "safe" is handled by parsing
> links inside of the links before the | is broken up. Works good for the
> parser, but not for anything you want to send to a callback.

A temporary hint for extension writers (until a final generic solution
is available in the framework) is to count the brackets and count the
pipes only while at the "main-link" level, that is:

0. Start a loop examining the string or string fragment.
1. Count UP on [ brackets. // $BracketsCnt++
2. Count DOWN on ] brackets. // $BracketsCnt--
3. Count | (pipes) ONLY when
BracketsCnt equals two // if ( $BracketsCnt = 2 )
PipeCnt++
4. Break loop if more than one pipe // if ( $PipesCnt > 1 ) Exit;
was found

This would determine this syntax error in this link
"[[ | | ]]" as well as in "[[ | [[ | | ]] ]]" (on the second call if
called recursively).


> So the plan is to actually build an object tree similar to the Frames
> and Parts the preprocessor uses. This'll allow for better handling of
> things inside of callbacks.


Which php file do you recommend me to start look at for this logic?

Regards,

// Rolf Lampa


_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


dan_the_man at telus

Aug 17, 2008, 6:36 PM

Post #9 of 9 (1159 views)
Permalink
Re: New Project, Link Hooks... Needing some research [In reply to]

The logic for links lies in parser/Parser.php's
Parser::replceInternalLinks (Actually now it's
Parser::replaceInternalLinks2 you should see), TimStarling also made
some recent changes so you may also want to look at
parser/LinkHolderArray.php.

For the preprocessor stuff parser/Preprocessor_DOM.php, however the
logic there is actually fairly more complex than what we'll even need.

I'm trying to find a way to get nested things to work right, without
ruining TimStarling's recent improvements to the memory and speed of
that area of the parser.
I did a small benchmark between:
A) recursive call; Find [[, walk till the closing ]] and do a recursive
call for the stuff in between. (This is similar to what we currently do
now, though we limit to a depth of 2)
B) markers and a single hashtable; As we find [['s we create a stack of
offsets, when a ]] is found we pop the last offset, create a new token
for the hashtable with the contents in between, and replace the text
with a marker. (Though when expanding, we need to expand multiple times
because the content of markers can have markers inside of them as well)

For a real flat setup A) and B) are similar, though A) does have a
slightly lower footprint (But do note that this test is flat string
replacement recursion, there is no link holders setup and we don't run a
setup which we would be running multiple times with the recursion, so a
actual Parser implementation would likely be heavier). However, when you
get into an insane level of nested brackets, A) starts to take 10x the
time that B) takes. This would be why we limit to a depth of 2
recursions, but may actually make using B) to create a tree possible. Of
course, links would never be nested like that, but when you are using a
different order of parsing it does get needed.

I'm considering creating another parser (inheriting from Parser) in
order to start experimenting and working on a different order. That
would allow us to use the Parser_DiffTest to make sure that for all use
cases syntax remains the same. (And also allow us to benchmark).

~Daniel Friesen(Dantman, Nadir-Seen-Fire) of:
-The Nadir-Point Group (http://nadir-point.com)
--It's Wiki-Tools subgroup (http://wiki-tools.com)
--The ElectronicMe project (http://electronic-me.org)
--Games-G.P.S. (http://ggps.org)
-And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
--Animepedia (http://anime.wikia.com)
--Narutopedia (http://naruto.wikia.com)

Rolf Lampa wrote:
> Daniel Friesen skrev:
>
>> Timstarling commented that adding links into the preprocessor would
>> change they way they are handled and would break cases like
>> http://sandbox.wiki-tools.com/edit/FakeLink changing the syntax of
>> WikiText in an incompatible way.
>>
>
> Yes, that's a good example.
>
>
>> <...>However my issue lies in the |, there is no
>> strict handling of those and making them "safe" is handled by parsing
>> links inside of the links before the | is broken up. Works good for the
>> parser, but not for anything you want to send to a callback.
>>
>
> A temporary hint for extension writers (until a final generic solution
> is available in the framework) is to count the brackets and count the
> pipes only while at the "main-link" level, that is:
>
> 0. Start a loop examining the string or string fragment.
> 1. Count UP on [ brackets. // $BracketsCnt++
> 2. Count DOWN on ] brackets. // $BracketsCnt--
> 3. Count | (pipes) ONLY when
> BracketsCnt equals two // if ( $BracketsCnt = 2 )
> PipeCnt++
> 4. Break loop if more than one pipe // if ( $PipesCnt > 1 ) Exit;
> was found
>
> This would determine this syntax error in this link
> "[[ | | ]]" as well as in "[[ | [[ | | ]] ]]" (on the second call if
> called recursively).
>
>
>
>> So the plan is to actually build an object tree similar to the Frames
>> and Parts the preprocessor uses. This'll allow for better handling of
>> things inside of callbacks.
>>
>
>
> Which php file do you recommend me to start look at for this logic?
>
> Regards,
>
> // Rolf Lampa
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l [at] lists
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Wikipedia wikitech RSS feed   Index | Next | Previous | View Threaded
 
 


Interested in having your list archived? Contact Gossamer Threads
 
  Web Applications & Managed Hosting Powered by Gossamer Threads Inc.