![]() |
Any good search engine scripts?
Anyone know of any good search engine scripts?
|
Do you want to run a real search engine or a meta search site ?
|
Like AdultKing said, your own Google or just spider your own pages?
|
|
Quote:
|
thanks to everyone who participated
|
My own real search engine, country specific. Im guessing for what Im thinking I might need to index up to 10 million pages. any suggestions?
|
There are a several ways to go about this.
Unless you are just making a raw index, you'd need pretty hefty bandwidth and hardware. You would want to cache pages, to compare them for updates, you'd also want some kind of cross words index which requires fast SQL. You also need a crawler unless you have some other way of obtaining an index. Lucene, part of the Apache project now. http://lucene.apache.org/java/docs/index.html It's all Java but you need to supply your own crawler agents. Hardware requirements pretty much depend how you scale it. Typically you'd build your crawlers on seperate boxes to the index, however if you use tunnels, you can run everything from one box and just farm out the jobs to specific boxes by rerouting traffic using iptables. http://webglimpse.net/ Webglimpse is good, but it's not free and you really need to know what you are doing, this is probably better as a document management solution or perhaps data mining. "The search engine (written in C) and webglimpse is the spider and indexer (primarily in Perl)". So it would be handy to know perl as crawler agents rarely do what you want out of the box. Zebra, which is a tool used by many search engine researchers, is free to use, source is available and can handle huge databases. https://www.indexdata.com/zebra There are other options if you have Java or C# development capability on a moderate scale. I can provide more options if none I have provided suit you. It's a really big subject, without knowing exactly what you want to achieve it's hard to say what tool set you should be looking at. The biggest barrier to entry here is the hardware you will need to do this, working out an optimal configuration is difficult, however if you are wanting to cut down on resources and be able to scale, I would suggest having a main server to handle traffic in / out of your search platform, then redirect particular types of traffic to slave servers not visible to the internet, eg: one to crawl, one to index, one to run your database, a NAS for your disk storage - which you will need alot of, terrabytes just to start with. |
Here are some others
http://www.htdig.org/ http://openfts.sourceforge.net/ http://www.namazu.org/ I have plenty more, tell us a little bit more about your expectations, what the system needs to do and what type of index you want to create ? Do you need a crawler, or will you write one ? Do you want full text search, or keyword based cross searching ? |
I would use Terrier. http:// terrier . org
Lucene is OK but not as easy to extend. Although you probably won't be able to get very far using anything unless you have studied IR in depth. Also, Links pulled. |
One observation, 10 million pages is not many, if it's country specific then surely you would need to index more than that ? Keep in mind that your published index does not necessarily equal the number of sites you crawl.
We run a porn only search engine, we index millions of pages, however we are crawling many times that number and discarding them from the index because they are not porn sites. The web space is really big, Netcraft estimated in 2008 that there were 152 million sites at the start of the year and then 182 million later in the same year, the net has grown exponentially since then so there would be according to Netcraft about 255 million at the end of 2010. Now that number is sites, some sites have hundreds of pages or more, so the crawling effort is mind bogglingly huge. You will also need to think about what tools you will need to manage such a huge data set, we use 3d modelling to map out the web space as we see it, it enables us to detect holes in our crawling efforts and then address that with updates to our crawler code. You will probably find that once you choose a tool set that you will want to alter it to suit your specific needs, we started with a code base for a crawler and indexer then changed it to suit our needs, what we ended up with bears little resemblance to what we started out with. So you'd need programmer support to keep your tools doing what you need them to. |
Quote:
|
Quote:
I want to keep it as simple as possible with the minimum load on a server. At this stage it may be something smaller, a state or a city, or a subject. Not really sure. Still in the early stage of this. Any advice is appreciated. You know any examples of anyone using terrier? |
Quote:
While thinking about architecture, a good way to look at things is that you have various components which go together to form a search platform. You have a crawler, indexer, database management component and the data set itself. One program won't handle the whole thing, so it won't be a script as suggested in your original question, it will be a series of programs running independently within a system. |
Quote:
Read this, it's probably one of the best comparisons of open source search engines you could read http://wrg.upf.edu/WRG/dctos/Middleton-Baeza.pdf |
Quote:
|
bookmarked
|
sorry for threadjack, but what about a search engine that would only search certain sites (not ones own) or maybe certain file types? something off the shelf?
|
http://www.sphinxsearch.com is awesome!
|
Quote:
Searching a group of sites is easier, if you know the sites you want to index then you're not having to discover the sites in the first place. If however you mean searching types of sites, this is the effort that our search engine is working on, simply indexing porn sites and almost nothing else. The crawling effort is huge, however the indexing effort is less of an issue as many sites are discarded as they are not porn sites. Off the shelf is where things get difficult, because the differences in search efforts means that anything off the shelf needs a certain amount of customization. |
Quote:
|
well stay with just searching certain sites. there are cheap scripts that search certain tube sites and embed the videos on your site.
Quote:
|
Quote:
That would be the type of task you could employ Sphinx or Zebra to do quite trivially. |
yeah not looking for a programmer atm just wondering if there is anything off the shelf.
|
Quote:
http://www.ivinco.com/blog/using-wge...search-engine/ |
Thankyou again for all this information AdultKing, you really know your stuff. Was thinking about this and think Ill start with something small as a trial n error sort of thing. What I want to do is add the websites to the engine myself, not have any sort of crawler. What do you recommend for that?
|
dazzling, maybe you need something like directory script then and just add and organize links.. with no sorting and filtering links you could probably get away with basic WP install
|
Google Yahoo etc. use economies of scale.
You can't search 5 billion pages in 0.02 seconds on 1 server. The query is split over the whole network, distributed processing. It takes the same total computer power to handle 200 million queries a day, but by using parallel processing you get the results 1000 times quicker. Each server tackles under a million web pages. 10 million pages is about the limit for a single server search engine. |
Quote:
Sphider of course wont scale, you will be able to index a few thousand sites reliably but it will begin to break down, Sphider certainly wont index a whole country. At this micro level, it would be possible to add affiliate links to the search results to potentially earn money from the links clicked on from search results. The question begs though, are people likely to use something so limited at an end user level ? |
Quote:
Quote:
While a query on the index can consume the processing power of 1000 machines, according to early reports of the Google platform, this is misleading, as the machines are built such that you are really just dealing with component systems which may be built from many "servers". There is a very good paper about the subject of using GPU's within a search architecture at http://koala.poly.edu/GPU.pdf |
useful thread...
|
A few years ago I used a really nice search engine script from fluid dynamics....
http://www.xav.com/scripts/search/ The script was great but very very limited in the amount of websites you could add and used up way too much CPU. It was good though for what I was doing at the time, the problem was the guy stopped development on the script. I think what I would like to do is move in stages, start with a small project so I can learn, then move into something bigger later on. |
Quote:
|
Quote:
You'll find Sphider comparable although PHP Based rather than Perl. The problem with these types of scripts is that they are scripts and a script won't become a real search engine. You just can't do the type of things you need to do to run a search engine from a script. You need several programs running with one or more databases at a minimum, a basic search engine platform will consist of a crawler, indexer and query engine at a minimum. It is possible to create a real search engine on one server, however on a limited scale. One of our test/development servers for PornoBug indexes a realm of web space, approximately 2.5 million sites on one Xeon server with 6TB of disk. However the machine is running under constant load and only runs as a search engine, it does however have a crawler, indexer and query interface all on the one machine. It crawls 100,000 pages a day and sites within the realm are typically visited every 2 to 3 days. |
Quote:
A crawler really needs to be able to work quickly, ours are all written in C and are very nimble, not one line of code is included unless it is absolutely necessary for the crawler to work. |
All times are GMT -7. The time now is 02:16 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc123