[Plugin] PawnScraper
#1

PawnScraper



A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.
Installing

Thanks to Southclaws,plugin installation is now much easier with sampctl

PHP Code:
sampctl p install Sreyas-Sreelal/pawn-scraper 
OR
  • Download suitable binary files from releases for your operating system
  • Add it your plugins folder
  • Add PawnScraper to server.cfg or PawnScraper.so (for linux)
  • Add pawnscraper.inc in includes folder
Building
  • Clone the repo

    PHP Code:
    git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git 
  • Compile the plugin using nightly compiler
    • Windows
      PHP Code:
      cargo +nightly-i686-pc-windows-msvc build --release 
    • Linux
      PHP Code:
      cargo +nightly-i686-unknown-linux-gnu build --release 
API
  • ParseHtmlDocument(document[])]
    • Params
      • document[] - string of html document
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
          "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      DeleteHtml(doc); 
  • ResponseParseHtml(Response:id)
    • Params
      • id - Http response id returned from HttpGet
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage

      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");
      new 
      Html:doc ResponseParseHtml(response);
      ASSERT(doc != INVALID_HTML_DOC);
      DeleteHtml(doc); 
  • HttpGet(url[],Header:headerid=INVALID_HEADER)
    • Params
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Returns
      • Response id if successful
      • if failed to INVALID_HTTP_RESPONSE is returned
    • Example Usage

      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");
      ASSERT(response != INVALID_HTTP_RESPONSE);
      DeleteResponse(response); 
  • HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)
    • Params
      • playerid - id of the player
      • callback[] - name of the callback function to handle the response.
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Example Usage
      PHP Code:
      HttpGetThreaded(0,"MyHandler","https://sa-mp.com");
      //********
      forward MyHandler(playerid,Response:responseid);
      public 
      MyHandler(playerid,Response:responseid){
          
      ASSERT(responseid != INVALID_HTTP_RESPONSE);
          
      DeleteResponse(responseid);

  • ParseSelector(string[])
    • Params
      • string[] - CSS selector
    • Returns
      • Selector instance id if successful
      • if failed to INVALID_SELECTOR is returned
    • Example Usage

      PHP Code:
      new Selector:selector ParseSelector("h1 .foo");
      ASSERT(selector != INVALID_SELECTOR);
      DeleteSelector(selector); 
  • CreateHeader(…)
    • Params
      • key,value pairs of String type
    • Returns
      • Header instance id if successful
      • if failed to INVALID_HEADER is returned
    • Example Usage

      PHP Code:
      new Header:header CreateHeader(
          
      "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
      );
      ASSERT(header != INVALID_HEADER);
      new 
      Response:response HttpGet("https://sa-mp.com/",header);
      ASSERT(response != INVALID_HTTP_RESPONSE);
      ASSERT(DeleteHeader(header) == 1); 
  • GetNthElementName(Html:docid,Selector:selectorid,i dx,string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n’th occurence of element in the document (starts from 0)
      • string[] - element name is stored
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      new 
      Selector:selector ParseSelector("i");
      ASSERT(selector != INVALID_SELECTOR);
      new 
      i= -1,element_name[10];
      while(
      GetNthElementName(doc,selector,++i,element_name)!=0){
          
      ASSERT(strcmp(element_name,"i") == 0);
      }
      DeleteSelector(selector);
      DeleteHtml(doc); 
  • GetNthElementText(Html:docid,Selector:selectorid,i dx,string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n’th occurence of element in the document (starts from 0)
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      new 
      Selector:selector ParseSelector("h1.foo");
      ASSERT(selector != INVALID_SELECTOR);
      new 
      element_text[20];
      ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);
      new 
      check strcmp(element_text,("Hello, world!"));
      ASSERT(check == 0);
      DeleteSelector(selector);
      DeleteHtml(doc); 
  • GetNthElementAttrVal(Html:docid,Selector:selectori d,idx,attribute[],string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n’th occurence of element in the document (starts from 0)
      • attribute[] - the attribute of element
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      new 
      Selector:selector ParseSelector("h1");
      ASSERT(selector != INVALID_SELECTOR);
      new 
      element_attribute[20];
      ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);
      new 
      check strcmp(element_attribute,("foo"));
      ASSERT(check == 0);
      DeleteSelector(selector);
      DeleteHtml(doc); 

  • DeleteHtml(Html:id)
    • Params
      • id - html instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteSelector(Selector:id)
    • Params
      • id - selector instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteResponse(Html:id)
    • Params
      • id - response instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteHeader(Header:id)
    • Params
      • id - header instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

Example Usage

A small example to fetch all links in wiki.sa-mp.com

PHP Code:
new Response:response HttpGet("https://wiki.sa-mp.com");
if(
response == INVALID_HTTP_RESPONSE){
    
printf("HTTP ERROR");
    return;
}
new 
Html:html ResponseParseHtml(response);
if(
html == INVALID_HTML_DOC){
    
DeleteResponse(response);
    return;
}
new 
Selector:selector ParseSelector("a");
if(
selector == INVALID_SELECTOR){
    
DeleteResponse(response);
    
DeleteHtml(html);
    return;
}
new 
str[500],i;
while(
GetNthElementAttrVal(html,selector,i,"href",str)){
    
printf("%s",str);
    ++
i;
}
//delete created objects after the usage..
DeleteHtml(html);
DeleteResponse(response);
DeleteSelector(selector); 

The same above with threaded http call would be

PHP Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public 
MyHandler(playerid,Response:responseid){
    if(
responseid == INVALID_HTTP_RESPONSE){
        
printf("HTTP ERROR");
        return 
0;
    }
    new 
Html:html ResponseParseHtml(responseid);
    if(
html == INVALID_HTML_DOC){
        
DeleteResponse(response);
        return 
0;
    }
    new 
Selector:selector ParseSelector("a");
    if(
selector == INVALID_SELECTOR){
        
DeleteResponse(response);
        
DeleteHtml(html);
        return 
0;
    }
    new 
str[500],i;
    while(
GetNthElementAttrVal(html,selector,i,"href",str)){
        
printf("%s",str);
        ++
i;
    }
    
DeleteHtml(html);
    
Delete(response);
    
DeleteSelector(selector);
    return 
1;



More examples can be found in examples

Repository
https://github.com/Sreyas-Sreelal/pawn-scraper

Note

The plugin is in primary stage and more tests and features needed to be added.I’m open to any kind of contribution, just open a pull request if you have anything to improve or add new features.

Special thanks
Reply
#2

cool
Reply
#3

hot.
Reply
#4

This is really good.
Reply
#5

Amazing! Finally a well-rounded solution to the HTTP() function
Reply
#6

New version released!

https://github.com/Sreyas-Sreelal/pa...ases/tag/0.1.0

Changes
  • Added HttpGetThreaded
  • Changed reqwest to minihttp
  • Smaller binary
Still might need more tests but the basic functionalities are working as expected.Big thanks to Eva who patiently listened to my questions and doubts and for giving me guidance in certain parts.

Usage of HttpGetThreaded
pawn Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid){
    if(responseid == INVALID_HTTP_RESPONSE){
        printf("HTTP ERROR");
        return 0;
    }

    new Html:html = ResponseParseHtml(responseid);
    if(html == INVALID_HTML_DOC){
        DeleteResponse(response);
        return 0;
    }

    new Selector:selector = ParseSelector("a");
    if(selector == INVALID_SELECTOR){
        DeleteResponse(response);
        DeleteHtml(html);
        return 0;
    }

    new str[500],i;
    while(GetNthElementAttrVal(html,selector,i,"href",str)){
        printf("%s",str);
        ++i;
    }

    DeleteHtml(html);
    Delete(response);
    DeleteSelector(selector);
    return 1;
}
Reply
#7

no
no you didnt
:O
Reply
#8

SAMP http requests are known to fail without a reason so does the http calls here always succeed without bugs?
Reply
#9

Quote:
Originally Posted by AmirSavand
View Post
SAMP http requests are known to fail without a reason so does the http calls here always succeed without bugs?
Http requests is working fine as per the tests,if you encountered any bugs open an issue on github. But do note that main scope of this plugin is not sending http requests (plugin can only be used to send GET requests ) but parsing HTML doc and using CSS selectors. Southclaw' requests plugin already gives a better solution to http requests.
Reply
#10

Nice work!

However, is there any way to send a HTTP request towards the SAMP server instead of only external URLs?
Reply
#11

Quote:
Originally Posted by fiki574
View Post
Nice work!

However, is there any way to send a HTTP request towards the SAMP server instead of only external URLs?
How are you supposed to send an HTTP requrest to a SA-MP server? You may try HttpGet("http://localhost"); if you have something listening on HTTP there.

Anyway, how does this plugin handle cleanup of created objects (responses, selectors etc.)?
Reply
#12

Quote:
Originally Posted by IllidanS4
View Post
Anyway, how does this plugin handle cleanup of created objects (responses, selectors etc.)?
Clean up is done through "Delete" functions. Its automatically called when created variable get out of scope through destructors. But they won't work in cases having global and static lifetime. Users have to call these functions manually in those cases.
Reply
#13

Quote:
Originally Posted by IllidanS4
View Post
How are you supposed to send an HTTP requrest to a SA-MP server? You may try HttpGet("http://localhost"); if you have something listening on HTTP there.
Maybe this plugin has an implementation for starting a HTTP listener with the SAMP server, so I could (for example) send GET requests from an external app towards that listener and parse some in-game stuff I want to the response.
Reply
#14

Quote:
Originally Posted by fiki574
View Post
Maybe this plugin has an implementation for starting a HTTP listener with the SAMP server, so I could (for example) send GET requests from an external app towards that listener and parse some in-game stuff I want to the response.
That's not what this plugin is about...
Reply
#15

Quote:
Originally Posted by SyS
View Post
Clean up is done through "Delete" functions. Its automatically called when created variable get out of scope through destructors. But they won't work in cases having global and static lifetime. Users have to call these functions manually in those cases.
I am not sure using a destructor is safe in this case. First, you ignore the size parameter, so arrays of these objects will not be destroyed properly. Second, imagine this code:
pawn Code:
new Response:globalResp;

main()
{
    new Response:resp = HttpGet("https://wiki.sa-mp.com");
    if(...)
    {
        globalResp = resp;
    }
}
When resp goes out of scope, globalResp will become invalid as well (and could potentially refer to a completely different response after a while, depending on your implementation).
Reply
#16

Quote:
Originally Posted by IllidanS4
View Post
I am not sure using a destructor is safe in this case. First, you ignore the size parameter, so arrays of these objects will not be destroyed properly. Second, imagine this code:
pawn Code:
new Response:globalResp;

main()
{
    new Response:resp = HttpGet("https://wiki.sa-mp.com");
    if(...)
    {
        globalResp = resp;
    }
}
When resp goes out of scope, globalResp will become invalid as well (and could potentially refer to a completely different response after a while, depending on your implementation).
Yes you are right that will result in fault. I think I should change my approach then.Something like borrow check or overload = operator to make a clone. I don't know whether either of is possible though
Reply
#17

Quote:
Originally Posted by SyS
View Post
That's not what this plugin is about...
That's why I was asking this question

Thanks for clearance
Reply
#18

Can I get data from a tag that 'has' a class, like:
Code:
<span class="some_class_here">data_i_want_to_get</span>
using this plugin?

I'm not into HTML parsing so I don't know yet how to work with this. Thank you.
Reply
#19

Quote:
Originally Posted by fordawinzz
View Post
Can I get data from a selector that 'has' a class, like:
Code:
<span class="some_class_here">data_i_want_to_get</span>
using this plugin?

I'm not into HTML parsing so I don't know yet how to work with this. Thank you.
Use
PHP Code:
new Selector:selectclass ParseSelector(".some_class_here");
new 
data[20];
GetNthElementText(your_html_doc,selectclass,0,data);//now data will have the text 
DeleteSelector(selectclass); 
Reply
#20

Thank you so much for this plugin! Also big thanks for this example other ******* 2 mp3 solutions is not working anymore but this plugin does it neatly

Rep+=3;
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)