Node.js Web 爬虫:Node Osmosis
Osmosis 是 Node.js 用来解析 HTML/XML 和 Web 内容爬取的扩展。
Features
- Fast: uses libxml C bindings
 - Lightweight: no dependencies like jQuery, cheerio, or jsdom
 - Clean: promise based interface- no more nested callbacks
 - Flexible: supports both CSS and XPath selectors
 - Predictable: same input, same output, same order
 - Detailed logging for every step
 - Precise and natural IO flow- no setTimeout or process.nextTick
 - Easy debugging with built-in stack size and memory usage reporting
 - Memory leak free
 
Example: scrape all craigslist listings
var osmosis = require('osmosis');     osmosis  .get('www.craigslist.org/about/sites')   .find('h1 + div a')  .set('location')  .follow('@href')  .find('header + div + div li > a')  .set('category')  .follow('@href')  .find('p > a', '.totallink + a.button.next:first')  .follow('@href')  .set({      'title':        'section > h2',      'description':  '#postingbody',      'subcategory':  'div.breadbox > span[4]',      'date':         'time@datetime',      'latitude':     '#map@data-latitude',      'longitude':    '#map@data-longitude',      'images[]':     'img@src'  })  .data(function(listing) {      // do something with listing data  })    本文由用户 n6xb  自行上传分享,仅供网友学习交流。所有权归原作者,若您的权利被侵害,请联系管理员。
                 转载本站原创文章,请注明出处,并保留原始链接、图片水印。
                 本站是一个以用户分享为主的开源技术平台,欢迎各类分享!