Schmid, A. (2010). Collecting online data anonymously : de-personalisation and security barriers on the Web [Diploma Thesis, Technische Universität Wien]. reposiTUm. http://hdl.handle.net/20.500.12708/161403
web intelligence; online market intelligence; web security; web data extraction; copyright issues; online anonymity; anonymization networks; cloud computing; wireless communication systems; blocking-resistance
de
Abstract:
Web Intelligence and Effective Online Market Intelligence rely on a 24/7 access to various online resources. Market Monitoring Processes and Meta-Search have to keep up with nowadays dynamic markets, monitoring changing product offerings and price volatility. Companies which made it their core business to collect and extract data have to face one central question: How can I de-personalize and anonymize my online actions to avoid blocking and backtracking in a legal way to keep my business running? On one hand, Denial of Service attacks and similar threats ask for a higher demand of IT security. The Web Security Problem focuses on online anonymity, potential attacks and content & communication blocking technologies. On the other hand, political engagement & censorship, as well as copyright issues play a crucial role regarding jurisdiction on the Web and the role of exclusive rights. Hence, my work also covers the issue, if a provider of data is allowed banning certain individuals from the access of information without any justification. My research primarily focuses on technical fundamentals regarding Web Data Extraction and dynamic IP allocation in digital wireless communication systems, proxy systems, wired anonymization networks, and cloud computing environments. Based on an overall comparison of most promising scientific methods, I finally designed conceptual architectures for an implementation into a business environment. Taking elastic network address distribution (cloud computing) and mobile network address distribution (UMTS/HSPA) into account, I concentrated on a deployment of a blocking-resistant proxy solution and a blocking-resistant TOR-based relay solution. An implementation of a distributed Python/Java MultiProxy-Switcher application using GAE (Google App Engine) based on a company's business processes (Web Data Extraction in the travel & tourism domain) delivered statistical and empirical data and revealed future problems and open issues.