Gossamer Forum
Home : Products : Gossamer Links : Version 1.x :

Poor Searching in SQL version

Quote Reply
Poor Searching in SQL version
I don't know why, but Links 2.0 searches better than the SQL version.

For example if you search: auto associates

You receive a lot of links because a lot of links have the words: auto and associates.
The idea is to receive only the links that have that two words.

I don't understand why Alex haven't done a better search system in the SQL version, right now is working very bad.
Quote Reply
Re: Poor Searching in SQL version In reply to
I think the way you described it is better than exact query searching, the way you said with searching each word is the way most search engines do it.
Quote Reply
Re: Poor Searching in SQL version In reply to
I think Alex is addressing that, but in Links 2.0 he was looking for the string "auto associates" as a string. In LinkSQL, the search system breaks up the search terms by whitespace and thus each term "auto" and "associates" is searched for.

There are two ways around this. Using an implicit "and" when generating a search query, this has to be done at the source code level, and might be something Alex is doing with the improved sort/search stuff.

The other way would be in the 'parse' routines, where any terms enclosed in single or double quotes were not parsed out, or were stored _both_ as a whitespace parsed set of terms _AND_ as a whole term. The problem being, that unless you hit the the whole term, you would not return the whole phrase.

This would increase the complexity of the search routines, but I'm sure that in a short while either Alex or someone else will address it in one of the two ways above. I know it's on my list of things to do -- using my "Keywords" field, the terms in the "Keywords" field will be parsed on ',' not on whitespace so they can be indexed as phrases such as "get well" and 'get' can be tossed out as a 'common' word.

I don't know if there is an SQL limitation on this, but I wouldn't think so, since a character string is a character string, but I'm not up to that yet Smile My old brain only absorbs all this new stuff so fast.

Quote Reply
Re: Poor Searching in SQL version In reply to
What is the reason to have a search engine that displays all the words that you search, the people wants only the best matches, not all the links that have the words you search, a search engine that works that way is prehistoric.

Imagine a people that is searching for: auto shop

Why I would want to see all the shop links, and all the auto links, if what I want is to search only an auto shop.

Come on Alex, all the new search engines work that way, also the idea is to search more than two words in any order, for example:

(Title Smile
Auto Shop
(Description Smile
We have the best selection of wheels, accesories, radios, and more for your vehicle.
(Link Smile
www.carshop.com

Maybe some people want to search using the words: shop wheels. or shop radio

Or maybe: car wheels, car radio, car accesories.

All the best search engines of today look in any of the fields, in any order to display a good match, I have sent the first message about this problem over 2 months ago, and any one is doing nothing.

It is very important that Alex makes a better search engine if he wants to have one of the best systems around.
Quote Reply
Re: Poor Searching in SQL version In reply to
SuperSearch,
I agree with you very much on the idea of searching any order. If someone searches for computer conferences and perl forums, I'd like the matches to show perl conferences and or computer forums.
I think they are working hard on the next version of links-sql, and other new scripts they have coming out, so maybe thats why you got no reply or a fix.
Quote Reply
Re: Poor Searching in SQL version In reply to
It's a lot harder to implement that it seems. In Links 2.0 you did a flat-file search of the database, inefficient, but you could do just about anything with it. SQL makes searching efficient, but you have to put a lot into generating the query.

And, not all search engines work that way. I use them a lot and some work better than others, and some return more garbage than others.

Links has a weighted 'search' where 'hits' in different areas of the records carry more weight, and you can order your returns by this.

It works best with keywords, and if someone monitors what's being added to the database, but left alone it can do things such as give more weight to the title and URL and less weight to the description, or vice versa.

Keyword weighting is more of a benefit than any sort of plain search, no matter how complex -- especially if the weights of the 'hits' really mean something.

I'm sure more complex search phrases will be added over time as the SQL engine is developed.

If the underlying engine is built right, there is really no limit to the complexity and features of the search that can be built in -- far beyond simple "auto dealer" or "get well" searchings.

Quote Reply
Re: Poor Searching in SQL version In reply to
this is really easy to do for links2.. not sure of a way to do it in sql.. since i'm just a beginner at it..

i do not beleive there is a regular expression for this..

i know there is one that does the exact opposite...

Code:
SELECT * FROM links WHERE title, url, description REGEXP in (@not_these_terms);

but i don't think there is one to do what you want..

jerry
Quote Reply
Re: Poor Searching in SQL version In reply to
Widgetz,
Could you tell me how to do it with links2?
Quote Reply
Re: Poor Searching in SQL version In reply to
The way the link_index is stored, you'd have to SELECT rows that contained any of the search terms, group them by ID and return only the records that were listed for each key. Because this is a search on the tables looking for a non-unique key, it might have to be done in a perl hash.

The other way, probably how the larger search engines do it, using persistant files, caching and such, would be to extract each term into temp table, then JOIN the tables, and pick only the hits, or join the tables and rank the hits with partial hits lower on the returned files. Then store the query and return your cookie to it.

This seems to be how many engines do it now, since only the first hits seem to contain all the search terms, or the whole search term, and subsequent hits are less and less relevant.

There might be a way to do it all in the SQL query, but because you are trying to compare multiple records from the same table, it's probably better to yank the hits, or x-number of hits and then JOIN them

There might be a slick way to do it just the same... but I'm not that far along yet.

Quote Reply
Re: Poor Searching in SQL version In reply to
Unfortunately the way Links SQL works, it makes it very difficult to do phrase searching. Links SQL currently takes a link's title, description and any other fields, then puts each word into an index table. You can then do AND or OR searching as well as substring searching.

An easy way to do phrase searching is to use SQL and do something like:

SELECT * FROM Links WHERE Title LIKE '%Auto Shop%'

However this is VERY inefficent and requires a full scan of the database, something you want to avoid. An indexed search in Links SQL is at least 50x faster then a full table scan (on our DMoz database, searches came back in < 2 seconds when indexed, and > 35 seconds when not indexed).

For the next version of Links SQL I have included a search-ni.cgi which allows you to perform phrase searches like Links 2.0, however it may not be that quick on very large directories, but should be good for < 30,000 links. (This was mainly in response to non english characters which don't use spaces to separate words, and thus can't be indexed properly).

Cheers,

Alex
Quote Reply
Re: Poor Searching in SQL version In reply to
Alex,

I hope you launch very soon the new version of links, because right now looking at the poor searching results of the SQL version, I can't use it.

The results you receive from every search are very poor, you have to browse a lot of links to see the one you need.

I don't think is too much difficult to make a search engine to work that way, maybe the people of mySQL can help.
Quote Reply
Re: Poor Searching in SQL version In reply to
Alex,

What about what I'm doing to my version -- where items inside double quotes or single quotes are treated as a single word _AND_ parsed. If a person enters "two words" in the search box, that is used as a full text search, then, on secondary search each is parsed and tables joined.

This _is_ more complicated, but should work for larger tables since everything is still indexed. The larger the database is, the more likely the use of intermediate tables will speed things up, and in cases where the directory is in excess of 500,000 entries or gets a lot of traffic, caching the popular searches and user-session information becomes more important.... but there's room to grow Smile

I'm planning on about 400,000 links, but I'm growing into it, since each link is individually entered, and installed. This would be the same as a newspaper or newsletter site (as some are doing) since you want the new stuff on top, but there is no reason to remove your archives with hard disks in the $400 range for 8 gig of server-quality storage, and DVD storage just around the corner.

I'm not pushing, I can see the growth paths, and the more I play with SQL and PERL OOP the more I see the potential for expansion.

I'm just impatiently chillin' my heels, since I don't want to start to modify code that is in transition. I'd rather wait for the 'stable' version, and apply upgrades. Right now I can see large chunks of code changing.

_PLEASE_ keep a change log, so we know how to apply the upgrades as "PATCHES" rather than just replacements.

Line numbers have problems in that when we apply a patch or an upgrade,we may comment it differently for version control, and/or already have changes applied. So block-level changes are better.

I know I've asked this a few times, but my concern is that as new releases come out, I'll have to start over to install the modifications -- but if there is a list of what is changed, I keep notes in my code what subroutines and files where changed and when, I can easily see if I can safely replace a whole file, or subroutine for an upgrade, or if I have to do it line-by-line.

Is the next release still scheduled for this week???

Quote Reply
Re: Poor Searching in SQL version In reply to
 
Quote:
I don't think is too much difficult to make a search engine to work that way, maybe the people of mySQL can help.

It is very difficult actually. Mysql does not support automatic indexing of fields > 255 characters, so Links SQL has to do it itself.

Quote:
What about what I'm doing to my version -- where items inside double quotes or single quotes are treated as a single word _AND_ parsed. If a person enters "two words" in the search box, that is used as a full text search, then, on secondary search each is parsed and tables joined.

That means though that you can only search on phrases that are actually phrases in the text? So a search of "two words" would match above (as it is in quotes) but a search of "search box" wouldn't match? I'm not sure if this makes sense (or am I not understanding)?

Cheers,

Alex
Quote Reply
Re: Poor Searching in SQL version In reply to
 
Quote:
I don't think is too much difficult to make a search engine to work that way, maybe the people of mySQL can help.

It is very difficult actually. Mysql does not support automatic indexing of fields > 255 characters, so Links SQL has to do it itself.

Quote:
What about what I'm doing to my version -- where items inside double quotes or single quotes are treated as a single word _AND_ parsed. If a person enters "two words" in the search box, that is used as a full text search, then, on secondary search each is parsed and tables joined.

That means though that you can only search on phrases that are actually phrases in the text? So a search of "two words" would match above (as it is in quotes) but a search of "search box" wouldn't match? I'm not sure if this makes sense (or am I not understanding)?

The next release is still planned for Friday although it may be a beta release due to the large number of new features.

Cheers,

Alex
Quote Reply
Re: Poor Searching in SQL version In reply to
Alex,

Because I have a keywords field, and we enter all our links ourselves, we can control the indexing process somewhate.

The problem with non-phrase searching is the relation of the two words cannot be ascertained. For instance searching for a record that contains both words [search box] would find "After an extensive search..... boxed set ..." The only way to do a phrase search is either on whole text, or by picking your phrases carefully.

We can do the latter -- "Get Well" -- for instance or "Mothers Day"

In our keywords field, we can enter:

greetings, love, mother, mothers day, feelings

and have those parsed out as individual tokens to be indexed.

This is all I mean. If there is a "Keywords" field, that field can be treated differently, in that phrases can be indexed as long as they are separated by ','

Won't work for everyone, but since whole-text searches turn up lots of garbage (as above) being able to control what is returned and the weight of a 'hit' is really important to us.

Quote Reply
Re: Poor Searching in SQL version In reply to
Alex, I am waiting for the upgrade of Links SQL, today is Friday.
Quote Reply
Re: Poor Searching in SQL version In reply to
Man, have a little patience. Alex is not a machine but a human being - and inevitably the Internet does map on people's lives. It is not even half way through the day. When he has it ready, he will make it known. Would you rather have it released hurriedly or properly - although Alex usually accomplishes both.

Dan Smile
Quote Reply
Re: Poor Searching in SQL version In reply to
You can have better search results, if you follow this instructions:

http://www.gossamer-threads.com/scripts/forum/resources/Forum9/HTML/000122.html