[Plugin] PawnScraper
#1

PawnScraper



A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.
Installing

Thanks to Southclaws,plugin installation is now much easier with sampctl

PHP Code:
sampctl p install Sreyas-Sreelal/pawn-scraper 
OR
  • Download suitable binary files from releases for your operating system
  • Add it your plugins folder
  • Add PawnScraper to server.cfg or PawnScraper.so (for linux)
  • Add pawnscraper.inc in includes folder
Building
  • Clone the repo

    PHP Code:
    git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git 
  • Compile the plugin using nightly compiler
    • Windows
      PHP Code:
      cargo +nightly-i686-pc-windows-msvc build --release 
    • Linux
      PHP Code:
      cargo +nightly-i686-unknown-linux-gnu build --release 
API
  • ParseHtmlDocument(document[])]
    • Params
      • document[] - string of html document
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
          "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      DeleteHtml(doc); 
  • ResponseParseHtml(Response:id)
    • Params
      • id - Http response id returned from HttpGet
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage

      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");
      new 
      Html:doc ResponseParseHtml(response);
      ASSERT(doc != INVALID_HTML_DOC);
      DeleteHtml(doc); 
  • HttpGet(url[],Header:headerid=INVALID_HEADER)
    • Params
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Returns
      • Response id if successful
      • if failed to INVALID_HTTP_RESPONSE is returned
    • Example Usage

      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");
      ASSERT(response != INVALID_HTTP_RESPONSE);
      DeleteResponse(response); 
  • HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)
    • Params
      • playerid - id of the player
      • callback[] - name of the callback function to handle the response.
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Example Usage
      PHP Code:
      HttpGetThreaded(0,"MyHandler","https://sa-mp.com");
      //********
      forward MyHandler(playerid,Response:responseid);
      public 
      MyHandler(playerid,Response:responseid){
          
      ASSERT(responseid != INVALID_HTTP_RESPONSE);
          
      DeleteResponse(responseid);

  • ParseSelector(string[])
    • Params
      • string[] - CSS selector
    • Returns
      • Selector instance id if successful
      • if failed to INVALID_SELECTOR is returned
    • Example Usage

      PHP Code:
      new Selector:selector ParseSelector("h1 .foo");
      ASSERT(selector != INVALID_SELECTOR);
      DeleteSelector(selector); 
  • CreateHeader(…)
    • Params
      • key,value pairs of String type
    • Returns
      • Header instance id if successful
      • if failed to INVALID_HEADER is returned
    • Example Usage

      PHP Code:
      new Header:header CreateHeader(
          
      "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
      );
      ASSERT(header != INVALID_HEADER);
      new 
      Response:response HttpGet("https://sa-mp.com/",header);
      ASSERT(response != INVALID_HTTP_RESPONSE);
      ASSERT(DeleteHeader(header) == 1); 
  • GetNthElementName(Html:docid,Selector:selectorid,i dx,string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n’th occurence of element in the document (starts from 0)
      • string[] - element name is stored
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      new 
      Selector:selector ParseSelector("i");
      ASSERT(selector != INVALID_SELECTOR);
      new 
      i= -1,element_name[10];
      while(
      GetNthElementName(doc,selector,++i,element_name)!=0){
          
      ASSERT(strcmp(element_name,"i") == 0);
      }
      DeleteSelector(selector);
      DeleteHtml(doc); 
  • GetNthElementText(Html:docid,Selector:selectorid,i dx,string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n’th occurence of element in the document (starts from 0)
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      new 
      Selector:selector ParseSelector("h1.foo");
      ASSERT(selector != INVALID_SELECTOR);
      new 
      element_text[20];
      ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);
      new 
      check strcmp(element_text,("Hello, world!"));
      ASSERT(check == 0);
      DeleteSelector(selector);
      DeleteHtml(doc); 
  • GetNthElementAttrVal(Html:docid,Selector:selectori d,idx,attribute[],string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the n’th occurence of element in the document (starts from 0)
      • attribute[] - the attribute of element
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      new 
      Selector:selector ParseSelector("h1");
      ASSERT(selector != INVALID_SELECTOR);
      new 
      element_attribute[20];
      ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);
      new 
      check strcmp(element_attribute,("foo"));
      ASSERT(check == 0);
      DeleteSelector(selector);
      DeleteHtml(doc); 

  • DeleteHtml(Html:id)
    • Params
      • id - html instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteSelector(Selector:id)
    • Params
      • id - selector instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteResponse(Html:id)
    • Params
      • id - response instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteHeader(Header:id)
    • Params
      • id - header instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

Example Usage

A small example to fetch all links in wiki.sa-mp.com

PHP Code:
new Response:response HttpGet("https://wiki.sa-mp.com");
if(
response == INVALID_HTTP_RESPONSE){
    
printf("HTTP ERROR");
    return;
}
new 
Html:html ResponseParseHtml(response);
if(
html == INVALID_HTML_DOC){
    
DeleteResponse(response);
    return;
}
new 
Selector:selector ParseSelector("a");
if(
selector == INVALID_SELECTOR){
    
DeleteResponse(response);
    
DeleteHtml(html);
    return;
}
new 
str[500],i;
while(
GetNthElementAttrVal(html,selector,i,"href",str)){
    
printf("%s",str);
    ++
i;
}
//delete created objects after the usage..
DeleteHtml(html);
DeleteResponse(response);
DeleteSelector(selector); 

The same above with threaded http call would be

PHP Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public 
MyHandler(playerid,Response:responseid){
    if(
responseid == INVALID_HTTP_RESPONSE){
        
printf("HTTP ERROR");
        return 
0;
    }
    new 
Html:html ResponseParseHtml(responseid);
    if(
html == INVALID_HTML_DOC){
        
DeleteResponse(response);
        return 
0;
    }
    new 
Selector:selector ParseSelector("a");
    if(
selector == INVALID_SELECTOR){
        
DeleteResponse(response);
        
DeleteHtml(html);
        return 
0;
    }
    new 
str[500],i;
    while(
GetNthElementAttrVal(html,selector,i,"href",str)){
        
printf("%s",str);
        ++
i;
    }
    
DeleteHtml(html);
    
Delete(response);
    
DeleteSelector(selector);
    return 
1;



More examples can be found in examples

Repository
https://github.com/Sreyas-Sreelal/pawn-scraper

Note

The plugin is in primary stage and more tests and features needed to be added.I’m open to any kind of contribution, just open a pull request if you have anything to improve or add new features.

Special thanks
Reply


Messages In This Thread
PawnScraper - by SyS - 12.11.2018, 16:44
Re: PawnScraper - by Gabriel432135 - 12.11.2018, 16:59
Re: PawnScraper - by kristo - 12.11.2018, 17:12
Re: PawnScraper - by Ermanhaut - 12.11.2018, 19:44
Re: PawnScraper - by Chaprnks - 15.11.2018, 20:38
Re: PawnScraper - by SyS - 24.11.2018, 12:39
Re: PawnScraper - by Infin1ty - 24.11.2018, 17:13
Re: PawnScraper - by AmirSavand - 26.11.2018, 12:14
Re: PawnScraper - by SyS - 26.11.2018, 12:18
Re: PawnScraper - by fiki574 - 26.11.2018, 13:52
Re: PawnScraper - by IllidanS4 - 27.11.2018, 09:57
Re: PawnScraper - by SyS - 27.11.2018, 10:36
Re: PawnScraper - by fiki574 - 27.11.2018, 12:15
Re: PawnScraper - by SyS - 27.11.2018, 12:18
Re: PawnScraper - by IllidanS4 - 27.11.2018, 13:13
Re: PawnScraper - by SyS - 27.11.2018, 13:17
Re: PawnScraper - by fiki574 - 27.11.2018, 14:07
Re: PawnScraper - by fordawinzz - 01.12.2018, 12:17
Re: PawnScraper - by SyS - 01.12.2018, 12:27
Re: PawnScraper - by Marshall32 - 01.12.2018, 13:02
Re: PawnScraper - by SyS - 09.12.2018, 06:29
Re: PawnScraper - by SyS - 16.12.2018, 02:50
Re: PawnScraper - by ipsLuan - 16.12.2018, 05:08
Re: PawnScraper - by SyS - 13.01.2019, 05:31

Forum Jump:


Users browsing this thread: 2 Guest(s)