| 注册
请输入搜索内容

热门搜索

Java Linux MySQL PHP JavaScript Hibernate jQuery Nginx
jopen
9年前发布

Python开源爬虫框架:Grab

Grab是一个Python开源Web爬虫框架。Grab提供非常多实用的方法来爬取网站和处理爬到的内容:

  • Automatic cookies (session) support
  • HTTP and SOCKS proxy with and without authorization
  • Keep-Alive support
  • IDN support
  • Tools to work with web forms
  • Easy multipart file uploading
  • Flexible customization of HTTP requests
  • Automatic charset detection
  • Powerful API of extracting info from HTML documents with XPATH queries
  • Asynchronous API to make thousands of simultaneous queries. This part of library called Spider and it is too big to even list its features in this README.
  • Python 3 ready

Grab Example

from grab import Grab  import logging    logging.basicConfig(level=logging.DEBUG)  g = Grab()  g.go('https://github.com/login')  g.set_input('login', '***')  g.set_input('password', '***')  g.submit()  g.doc.save('/tmp/x.html')    g.doc('//span[contains(@class, "octicon-sign-out")]').assert_exists()  home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()  repo_url = home_url + '?tab=repositories'    g.go(repo_url)  for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):      print('%s: %s' % (elem.text(),                        g.make_url_absolute(elem.attr('href'))))

项目主页:http://www.open-open.com/lib/view/home/1440858338263

 本文由用户 jopen 自行上传分享,仅供网友学习交流。所有权归原作者,若您的权利被侵害,请联系管理员。
 转载本站原创文章,请注明出处,并保留原始链接、图片水印。
 本站是一个以用户分享为主的开源技术平台,欢迎各类分享!
 本文地址:https://www.open-open.com/lib/view/open1440858338263.html
Grab 网络爬虫