Shell script help needed - GoFuckYourself.com

acctman · 08-06-2011, 12:51 PM

Can someone who knows shell scripting spot my problem, everything appears to be correct but it's returning no results.

this is the html code that has the item_id code (ex: 55963573) that I need to collect

Code:

<a href="http://www.domain.com/vendors/cat.html?item_id=55963573" 
onclick="itemPlayPlop.open(this.href); return false;">

shell script

Code:

while read prodName;
do
  wget -q -U Mozilla "http://www.domain.com/$prodName/" -O - \
  | tr '"' '\n' | grep "^?item_id=" | cut -d ' ' -f 4 >> itemIDs.txt
done < catNames.txt

thanks in advance

critical · 08-06-2011, 01:33 PM

Check to make sure the domain you are querying is actually returning results to
you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight.

:-)

acctman · 08-06-2011, 02:06 PM

Quote:

Originally Posted by critical

Check to make sure the domain you are querying is actually returning results to
you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight.

:-)

weird cause I used a similar code to get the product names

Code:

for page in {1..50}
do
        wget -q -U Mozilla "http://www.domain.com/catalog_search/cat?p=$page" -O - \
         | tr '"' '\n' | grep "^Product photo for " | cut -d ' ' -f 4 >> catNames.txt
        sleep 15
done

V_RocKs · 08-06-2011, 02:16 PM

No idea how to help you without the data example.

Barry-xlovecam · 08-06-2011, 02:49 PM

from the manual;

Quote:

?-U agent-string?
?--user-agent=agent-string?
Identify as agent-string to the http server.

The http protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the www software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ?Wget/version?, version being the current version number of Wget.

However, some sites have been known to impose the policy of tailoring the output according to the User-Agent-supplied information. While this is not such a bad idea in theory, it has been abused by servers denying information to clients other than (historically) Netscape or, more frequently, Microsoft Internet Explorer. This option allows you to change the User-Agent line issued by Wget. Use of this option is discouraged, unless you really know what you are doing.

Specifying empty user agent with ?--user-agent=""? instructs Wget not to send the User-Agent header in http requests.

http://www.gnu.org/software/wget/man....html#Invoking

acctman · 08-06-2011, 02:53 PM

Quote:

Originally Posted by V_RocKs

No idea how to help you without the data example.

this is the html line i'm interested in. i need to extract 55963573

Code:

<a href="http://www.domain.com/vendors/cat.html?item_id=55963573" 
onclick="itemPlayPlop.open(this.href); return false;">

raymor · 08-06-2011, 05:23 PM

It appears one problem is that you've anchored the grep:

Code:

grep "^?item_id="

In your example "?item_id" isn't the beginning of a line, so the ^ anchor means
nothing matches. Also, remember ? is a metacharacter.

You'll probably not get much more help without posting your actual code with the
real URL so somebody can see what is going on. When you obfuscate things you may
as well ask why this doesn't work:

Code:

some code
   some more code 
 also code
if code then
do some stuff
fi 
< input I'm not showing you

08-06-2011, 12:51 PM	#1
acctman Confirmed User Join Date: Oct 2003 Location: Atlanta Posts: 2,840	Shell script help needed Can someone who knows shell scripting spot my problem, everything appears to be correct but it's returning no results. this is the html code that has the item_id code (ex: 55963573) that I need to collect Code: <a href="http://www.domain.com/vendors/cat.html?item_id=55963573" onclick="itemPlayPlop.open(this.href); return false;"> shell script Code: while read prodName; do wget -q -U Mozilla "http://www.domain.com/$prodName/" -O - \ \| tr '"' '\n' \| grep "^?item_id=" \| cut -d ' ' -f 4 >> itemIDs.txt done < catNames.txt thanks in advance

08-06-2011, 05:23 PM	#7
raymor Confirmed User Join Date: Oct 2002 Posts: 3,745	It appears one problem is that you've anchored the grep: Code: grep "^?item_id=" In your example "?item_id" isn't the beginning of a line, so the ^ anchor means nothing matches. Also, remember ? is a metacharacter. You'll probably not get much more help without posting your actual code with the real URL so somebody can see what is going on. When you obfuscate things you may as well ask why this doesn't work: Code: some code some more code also code if code then do some stuff fi < input I'm not showing you __________________ For historical display only. This information is not current: support@bettercgi.com ICQ 7208627 Strongbox - The next generation in site security Throttlebox - The next generation in bandwidth control Clonebox - Backup and disaster recovery on steroids

08-06-2011, 01:33 PM	#2
critical Confirmed User Join Date: Aug 2009 Posts: 478	Check to make sure the domain you are querying is actually returning results to you. A smart admin blocks queries from wget to db/query servers to avoid certain ddos attacks while a smart coder sets the client settings in wget to match that of mozilla or another popular web browser so it does not look automated. Set wget to look like a browser and see if you get better results. Code looks straight. :-)

08-06-2011, 02:16 PM	#4
V_RocKs Damn Right I Kiss Ass! Industry Role: Join Date: Dec 2003 Location: Cowtown, USA Posts: 32,409	No idea how to help you without the data example.