Welcome to the GoFuckYourself.com - Adult Webmaster Forum forums.

You are currently viewing our boards as a guest which gives you limited access to view most discussions and access our other features. By joining our free community you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content and access many other special features. Registration is fast, simple and absolutely free so please, join our community today!

If you have any problems with the registration process or your account login, please contact us.

Post New Thread Reply

Register GFY Rules Calendar
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >
Discuss what's fucking going on, and which programs are best and worst. One-time "program" announcements from "established" webmasters are allowed.

 
Thread Tools
Old 08-06-2011, 12:51 PM   #1
acctman
Confirmed User
 
Join Date: Oct 2003
Location: Atlanta
Posts: 2,840
Shell script help needed

Can someone who knows shell scripting spot my problem, everything appears to be correct but it's returning no results.

this is the html code that has the item_id code (ex: 55963573) that I need to collect
Code:
<a href="http://www.domain.com/vendors/cat.html?item_id=55963573" 
onclick="itemPlayPlop.open(this.href); return false;">
shell script
Code:
while read prodName;
do
  wget -q -U Mozilla "http://www.domain.com/$prodName/" -O - \
  | tr '"' '\n' | grep "^?item_id=" | cut -d ' ' -f 4 >> itemIDs.txt
done < catNames.txt
thanks in advance
acctman is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-06-2011, 01:33 PM   #2
critical
Confirmed User
 
Join Date: Aug 2009
Posts: 478
Check to make sure the domain you are querying is actually returning results to
you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight.

:-)
critical is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-06-2011, 02:06 PM   #3
acctman
Confirmed User
 
Join Date: Oct 2003
Location: Atlanta
Posts: 2,840
Quote:
Originally Posted by critical View Post
Check to make sure the domain you are querying is actually returning results to
you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight.

:-)
weird cause I used a similar code to get the product names

Code:
for page in {1..50}
do
        wget -q -U Mozilla "http://www.domain.com/catalog_search/cat?p=$page" -O - \
         | tr '"' '\n' | grep "^Product photo for " | cut -d ' ' -f 4 >> catNames.txt
        sleep 15
done
acctman is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-06-2011, 02:16 PM   #4
V_RocKs
Damn Right I Kiss Ass!
 
Industry Role:
Join Date: Dec 2003
Location: Cowtown, USA
Posts: 32,409
No idea how to help you without the data example.
V_RocKs is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-06-2011, 02:49 PM   #5
Barry-xlovecam
It's 42
 
Industry Role:
Join Date: Jun 2010
Location: Global
Posts: 18,083
from the manual;

Quote:
?-U agent-string?
?--user-agent=agent-string?
Identify as agent-string to the http server.

The http protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the www software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ?Wget/version?, version being the current version number of Wget.

However, some sites have been known to impose the policy of tailoring the output according to the User-Agent-supplied information. While this is not such a bad idea in theory, it has been abused by servers denying information to clients other than (historically) Netscape or, more frequently, Microsoft Internet Explorer. This option allows you to change the User-Agent line issued by Wget. Use of this option is discouraged, unless you really know what you are doing.

Specifying empty user agent with ?--user-agent=""? instructs Wget not to send the User-Agent header in http requests.
http://www.gnu.org/software/wget/man....html#Invoking
Barry-xlovecam is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-06-2011, 02:53 PM   #6
acctman
Confirmed User
 
Join Date: Oct 2003
Location: Atlanta
Posts: 2,840
Quote:
Originally Posted by V_RocKs View Post
No idea how to help you without the data example.
this is the html line i'm interested in. i need to extract 55963573
Code:
<a href="http://www.domain.com/vendors/cat.html?item_id=55963573" 
onclick="itemPlayPlop.open(this.href); return false;">
acctman is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Old 08-06-2011, 05:23 PM   #7
raymor
Confirmed User
 
Join Date: Oct 2002
Posts: 3,745
It appears one problem is that you've anchored the grep:

Code:
grep "^?item_id="
In your example "?item_id" isn't the beginning of a line, so the ^ anchor means
nothing matches. Also, remember ? is a metacharacter.

You'll probably not get much more help without posting your actual code with the
real URL so somebody can see what is going on. When you obfuscate things you may
as well ask why this doesn't work:

Code:
some code
   some more code 
 also code
if code then
do some stuff
fi 
< input I'm not showing you
__________________
For historical display only. This information is not current:
support&#64;bettercgi.com ICQ 7208627
Strongbox - The next generation in site security
Throttlebox - The next generation in bandwidth control
Clonebox - Backup and disaster recovery on steroids
raymor is offline   Share thread on Digg Share thread on Twitter Share thread on Reddit Share thread on Facebook Reply With Quote
Post New Thread Reply
Go Back   GoFuckYourself.com - Adult Webmaster Forum > >

Bookmarks



Advertising inquiries - marketing at gfy dot com

Contact Admin - Advertise - GFY Rules - Top

©2000-, AI Media Network Inc



Powered by vBulletin
Copyright © 2000- Jelsoft Enterprises Limited.