GoFuckYourself.com - Adult Webmaster Forum - Any good search engine scripts?

GoFuckYourself.com - Adult Webmaster Forum

GoFuckYourself.com - Adult Webmaster Forum (https://gfy.com/index.php)

- Fucking Around & Business Discussion (https://gfy.com/forumdisplay.php?f=26)

- - Any good search engine scripts? (https://gfy.com/showthread.php?t=1019377)

dazzling

04-20-2011 10:47 PM

Any good search engine scripts?

Anyone know of any good search engine scripts?

AdultKing

04-20-2011 10:53 PM

Do you want to run a real search engine or a meta search site ?

Zorgman

04-21-2011 04:33 AM

Like AdultKing said, your own Google or just spider your own pages?

u-Bob

04-21-2011 04:48 AM

http://www.google.com/_private/cvs/p...gle-STABLE.tgz

sarettah

04-21-2011 04:58 AM

Quote:

Originally Posted by u-Bob (Post 18075230)

http://www.google.com/_private/cvs/p...gle-STABLE.tgz

:1orglaugh :thumbsup

HarryMuff

04-21-2011 05:52 AM

thanks to everyone who participated

dazzling

04-21-2011 06:12 AM

My own real search engine, country specific. Im guessing for what Im thinking I might need to index up to 10 million pages. any suggestions?

AdultKing

04-21-2011 07:44 AM

There are a several ways to go about this.

Unless you are just making a raw index, you'd need pretty hefty bandwidth and hardware. You would want to cache pages, to compare them for updates, you'd also want some kind of cross words index which requires fast SQL. You also need a crawler unless you have some other way of obtaining an index.

Lucene, part of the Apache project now.

http://lucene.apache.org/java/docs/index.html

It's all Java but you need to supply your own crawler agents. Hardware requirements pretty much depend how you scale it. Typically you'd build your crawlers on seperate boxes to the index, however if you use tunnels, you can run everything from one box and just farm out the jobs to specific boxes by rerouting traffic using iptables.

http://webglimpse.net/

Webglimpse is good, but it's not free and you really need to know what you are doing, this is probably better as a document management solution or perhaps data mining. "The search engine (written in C) and webglimpse is the spider and indexer (primarily in Perl)". So it would be handy to know perl as crawler agents rarely do what you want out of the box.

Zebra, which is a tool used by many search engine researchers, is free to use, source is available and can handle huge databases.

https://www.indexdata.com/zebra

There are other options if you have Java or C# development capability on a moderate scale. I can provide more options if none I have provided suit you.

It's a really big subject, without knowing exactly what you want to achieve it's hard to say what tool set you should be looking at.

The biggest barrier to entry here is the hardware you will need to do this, working out an optimal configuration is difficult, however if you are wanting to cut down on resources and be able to scale, I would suggest having a main server to handle traffic in / out of your search platform, then redirect particular types of traffic to slave servers not visible to the internet, eg: one to crawl, one to index, one to run your database, a NAS for your disk storage - which you will need alot of, terrabytes just to start with.

AdultKing

04-21-2011 07:54 AM

Here are some others

http://www.htdig.org/

http://openfts.sourceforge.net/

http://www.namazu.org/

I have plenty more, tell us a little bit more about your expectations, what the system needs to do and what type of index you want to create ? Do you need a crawler, or will you write one ? Do you want full text search, or keyword based cross searching ?

HarryMuff

04-21-2011 08:01 AM

I would use Terrier. http:// terrier . org
Lucene is OK but not as easy to extend.
Although you probably won't be able to get very far using anything unless you have studied IR in depth.
Also, Links pulled.

AdultKing

04-21-2011 08:04 AM

One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.

AdultKing

04-21-2011 08:08 AM

Quote:

Originally Posted by HarryMuff (Post 18075566)

I would use Terrier. http:// terrier . org

Yes I agree terrier is also a good place to start, there is a fairly wide user base so getting help is not so difficult, it's also quite extensible. It's all Java and you would want to have a pretty good grip on Java to implement your own terrier based system.

dazzling

04-21-2011 08:13 AM

Quote:

Originally Posted by AdultKing (Post 18075573)

One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.

We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge.

You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code.

You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to.

Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.

I want to keep it as simple as possible with the minimum load on a server. At this stage it may be something smaller, a state or a city, or a subject. Not really sure. Still in the early stage of this. Any advice is appreciated.

You know any examples of anyone using terrier?

AdultKing

04-21-2011 08:23 AM

Quote:

Originally Posted by dazzling (Post 18075597)

Thankyou for all that info, it would be for a small country, something like your next door neighbour NZ. I would not index porn. so that should reduce the load. Something I can add the websites myself I think. Your right, I might need as much as 100 million pages. I did see some open source such as Xapian but have no idea. As you said, its all mind boggling, but I need start somewhere I guess.

You may not index porn, but you will be crawling it to a certain extent. I assume you will just crawl a ccTLD name space, eg .nz or .uk, in that case you also need to think about sites which fall outside the namespace, for example some large sites have ccTLD domains but end up at nz.bigcompany.com , you'd need to have some way of dealing with that or your index would be fairly incomplete at the top end.

While thinking about architecture, a good way to look at things is that you have various components which go together to form a search platform. You have a crawler, indexer, database management component and the data set itself. One program won't handle the whole thing, so it won't be a script as suggested in your original question, it will be a series of programs running independently within a system.

AdultKing

04-21-2011 08:33 AM

Quote:

Originally Posted by dazzling (Post 18075597)

You know any examples of anyone using terrier?

You will probably like to think about a fork of Terrier, well a branch rather than a fork, called Terraneo http://distterr.wordpress.com/2010/0...search-engine/

Read this, it's probably one of the best comparisons of open source search engines you could read

http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf

AdultKing

04-21-2011 09:01 AM

Quote:

Originally Posted by dazzling (Post 18075597)

I want to keep it as simple as possible with the minimum load on a server.

Just to save you time, you will need more than one server, even if you use a homogeneous search engine/crawler you won't get away with even a 100 million page index with just one server, unless you plan on only crawling say 10,000 pages a day. Think of the processing needed for things like clone detection, caching etc.

MrGusMuller

04-21-2011 09:44 AM

bookmarked

Agent 488

04-21-2011 10:16 AM

sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?

nextri

04-21-2011 10:22 AM

http://www.sphinxsearch.com is awesome!

AdultKing

04-21-2011 10:27 AM

Quote:

Originally Posted by Agent 488 (Post 18075953)

sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?

Good question, there are search platforms suited for searching certain file types, for example mnGosearch was good at finding mp3 files, however scalability is a huge problem with that architecture, given that mapping the entire web and finding all mp3 files would be a herculean task.

Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites.

Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization.

AdultKing

04-21-2011 10:31 AM

Quote:

Originally Posted by nextri (Post 18075968)

http://www.sphinxsearch.com is awesome!

For a large subset of the web though ? I always considered sphinx better for more localized search efforts.

Agent 488

04-21-2011 10:40 AM

well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.

Quote:

Originally Posted by AdultKing (Post 18075982)

Good question, there are search platforms suited for searching certain file types, for example mnGosearch was good at finding mp3 files, however scalability is a huge problem with that architecture, given that mapping the entire web and finding all mp3 files would be a herculean task.

Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites.

Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization.

AdultKing

04-21-2011 10:44 AM

Quote:

Originally Posted by Agent 488 (Post 18076033)

well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.

That's pretty easy to do though, because you're not having to deal with massive amounts of data, well not as massive as indexing a subset of the web.

That would be the type of task you could employ Sphinx or Zebra to do quite trivially.

Agent 488

04-21-2011 10:45 AM

yeah not looking for a programmer atm just wondering if there is anything off the shelf.

AdultKing

04-21-2011 10:47 AM

Quote:

Originally Posted by Agent 488 (Post 18076047)

yeah not looking for a programmer atm just wondering if there is anything off the shelf.

Sphinx + wget

http://www.ivinco.com/blog/using-wge...search-engine/

dazzling

04-21-2011 09:51 PM

Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?

Serge Litehead

04-21-2011 10:06 PM

dazzling, maybe you need something like directory script then and just add and organize links.. with no sorting and filtering links you could probably get away with basic WP install

cam_girls

04-21-2011 10:07 PM

Google Yahoo etc. use economies of scale.

You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing.

It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages.

10 million pages is about the limit for a single server search engine.

AdultKing

04-22-2011 12:39 AM

Quote:

Originally Posted by dazzling (Post 18077849)

Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?

There is a PHP Search engine, called Sphider, you could use to add sites yourself, however this seems a lot more limited than what you were originally proposing to do, the link is http://www.sphider.eu/

Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country.

At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ?

AdultKing

04-22-2011 12:53 AM

Quote:

Originally Posted by cam_girls (Post 18077861)

Google Yahoo etc. use economies of scale.

You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing.

Actually, you will find that the data sets are built in such a way that searches across big data sets can be achieved by relatively little processing power. This is all to do with how the data set is built at the indexing level. Very clever algorithms allow you to search very massive data sets very quickly using just one CPU as the back end systems do the hard work.

Quote:

It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages.

Fairly simplistic, overly simplistic explanation. The data set is one complete entity. Preprocessing of the data set is what is key here, there are different approaches such as tokenization, term stemming, crossword indexing, weighting, ranking, the list goes on.

While a query on the index can consume the processing power of 1000 machines, according to early reports of the Google platform, this is misleading, as the machines are built such that you are really just dealing with component systems which may be built from many "servers".

There is a very good paper about the subject of using GPU's within a search architecture at http://koala.poly.edu/GPU.pdf

bbobby86

04-22-2011 12:55 AM

useful thread...

dazzling

04-22-2011 01:00 AM

A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/

The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script.

I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on.

dazzling

04-22-2011 01:03 AM

Quote:

Originally Posted by AdultKing (Post 18077989)

There is a PHP Search engine, called Sphider, you could use to add sites yourself, however this seems a lot more limited than what you were originally proposing to do, the link is http://www.sphider.eu/

Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country.

At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ?

Sphider looks good as a starting point for a smaller search engine before I try something bigger. Would it be comfortable with indexing 10,000 pages from say several hundred websites?

AdultKing

04-22-2011 01:09 AM

Quote:

Originally Posted by dazzling (Post 18078006)

A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/

The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script.

I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on.

You'll find Sphider comparable although PHP Based rather than Perl.

The problem with these types of scripts is that they are scripts and a script won't become a real search engine. You just can't do the type of things you need to do to run a search engine from a script. You need several programs running with one or more databases at a minimum, a basic search engine platform will consist of a crawler, indexer and query engine at a minimum.

It is possible to create a real search engine on one server, however on a limited scale. One of our test/development servers for PornoBug indexes a realm of web space, approximately 2.5 million sites on one Xeon server with 6TB of disk. However the machine is running under constant load and only runs as a search engine, it does however have a crawler, indexer and query interface all on the one machine. It crawls 100,000 pages a day and sites within the realm are typically visited every 2 to 3 days.

AdultKing

04-22-2011 01:13 AM

Quote:

Originally Posted by dazzling (Post 18078009)

Sphider looks good as a starting point for a smaller search engine before I try something bigger. Would it be comfortable with indexing 10,000 pages from say several hundred websites?

Try it and see, I do know that the sphider crawler is really slow as the data set grows.

A crawler really needs to be able to work quickly, ours are all written in C and are very nimble, not one line of code is included unless it is absolutely necessary for the crawler to work.

All times are GMT -7. The time now is 04:52 PM.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123