A new kind of pub crawl

WebÂsites like FaceÂbook, LinkedIn and other social-​​media netÂworks conÂtain masÂsive amounts of valuÂable public inforÂmaÂtion. AutoÂmated web tools called web crawlers sift through these sites, pulling out inforÂmaÂtion on milÂlions of people in order to tailor search results and create tarÂgeted ads or other marÂketable content.
But what hapÂpens when "the bad guys" employ web crawlers? For Engin Kirda, Sy and Laurie SternÂberg InterÂdisÂciÂpliÂnary AssoÂciate ProÂfessor for InforÂmaÂtion AssurÂance in the ColÂlege of ComÂputer and InforÂmaÂtion SciÂence and the DepartÂment of ElecÂtrical and ComÂputer EngiÂneering, they then become tools for spamÂming, phishing or tarÂgeted Internet attacks.
"You want to proÂtect the inforÂmaÂtion," Kirda said. "You want people to be able to use it, but you don't want people to be able to autoÂmatÂiÂcally downÂload conÂtent and abuse it."
Kirda and his colÂleagues at the UniÂverÂsity of California–Santa BarÂbara have develÂoped a new softÂware call PubÂCrawl to solve this problem. PubÂCrawl both detects and conÂtains maliÂcious web crawlers without limÂiting normal browsing capacÂiÂties. The team joined forces with one of the major social-​​networking sites to test PubÂCrawl, which is now being used in the field to proÂtect users' information.
Kirda and his colÂlabÂoÂraÂtors preÂsented a paper on their novel approach at the 21st USENIX SecuÂrity SymÂpoÂsium in early August. The article will be pubÂlished in the proÂceedÂings of the conÂferÂence this fall.
In the cyberÂseÂcuÂrity arms race, Kirda explained, maliÂcious web crawlers have become increasÂingly sophisÂtiÂcated in response to stronger proÂtecÂtion strateÂgies. In parÂticÂular, they have become more coorÂdiÂnated: Instead of utiÂlizing a single comÂputer or IP address to crawl the web for valuÂable inforÂmaÂtion, efforts are disÂtribÂuted across thouÂsands of machines.
"That becomes a tougher problem to solve because it looks simÂilar to benign user traffic," Kirda said. "It's not as straightforward."
TraÂdiÂtional proÂtecÂtion mechÂaÂnisms, like a CAPTCHA, which operÂates on an indiÂvidual basis, are still useful, but their deployÂment comes at a cost: Users may be annoyed if too many CAPTCHAs are shown. As an alterÂnaÂtive, nonÂinÂtruÂsive approach, PubÂCrawl was specifÂiÂcally designed with disÂtribÂuted crawling in mind. By idenÂtiÂfying IP addresses with simÂilar behavior patÂterns, such as conÂnecting at simÂilar interÂvals and freÂquenÂcies, PubÂCrawl detects what it expects to be disÂtribÂuted web-​​crawling activity.
Once a crawler is detected, the quesÂtion is whether it is maliÂcious or benign. "You don't want to block it comÂpletely until you know for sure it is maliÂcious," Kirda explained. "Instead, PubÂCrawl essenÂtially keeps an eye on it."
PotenÂtially maliÂcious conÂnecÂtions can be rate-​​limited and a human operÂator can take a closer look. If the operÂaÂtors decide that the activity is maliÂcious, IPs can also be blocked.
In order to evalÂuate the approach, Kirda and his colÂleagues used it to scan logs from a large-​​scale social netÂwork, which then proÂvided feedÂback on its sucÂcess. Then, the social netÂwork deployed it in real time, for a more robust evalÂuÂaÂtion. CurÂrently, the social netÂwork is using the tool as a part of its proÂducÂtion system. Going forÂward, the team expects to idenÂtify areas where the softÂware could be evaded and make it even stronger.
Provided by Northeastern University