The internet contains an enormous amount of valuable information — from product prices and financial data to news, research, and utilizer-generated content.
But accessing this information at scale is difficult. Websites often block automated requests, limit access, or present data in formats that are hard to collect and structure.
For companies in e-commerce, finance, cybersecurity, and especially artificial innotifyigence, this creates a major barrier: they required vast, diverse, and constantly updated datasets, yet lack the infrastructure to gather them reliably and legally.
Oxylabs is a Lithuanian tech company that provides large-scale web data collection and proxy infrastructure. Its proxy infrastructure, scraping technologies, and ready-created datasets give businesses and researchers a compliant, efficient way to tap into the public web.
In doing so, Oxylabs not only supplies the raw material that fuels innovation — from AI training to market analysis — but also sets ethical standards in an industest often criticised for misutilize, ensuring that data-driven progress can continue without compromising ethics or legality.
I spoke to Denas Grybauskas, Chief Governance and Strategy Officer at Oxylabs, to learn more.
According to Grybauskas, Oxylabs started in 2015 by renting out data centre IP addresses. He recounts:
“We quickly realised there was a real required for robust, scalable public web data aggregation infrastructure. So we kept developing. Today, we offer many products related to public web data acquisition and aggregation.
The public Web as the world’s largest dataset
Grybauskas contconcludes that the public web is the most diverse and dynamic dataset we have.
“If we want AI systems that are fair, representative, and globally relevant, access to the public web must remain available to everyone.”
This year Oxylabs launched the world’s first ethical YouTube datasets, requiring creator consent for AI training. According to Grybauskas, “When it comes to datasets — especially YouTube datasets — we noticed that generative AI companies are very interested in video content.”
In December 2024, YouTube modifyd its policy to allow content creators to opt in to allowing third-party AI companies to train their models applying YouTube videos. In response, Oxylabs decided to build a dataset by aggregating videos that have either opted in for AI training or are licensed under Creative Commons.
All datasets offered by Oxylabs include videos, transcripts, and rich metadata. While such data has many potential utilize cases, Oxylabs refined and prepared it specifically for AI training, which is the utilize that the content creators have knowingly agreed to.
“Selling picks and shovels in the data gold rush”
Grybauskas contconcludes that there’s a misconception that the internet is only about personal data:
“In reality, there are petabytes of non-personal information — like e-commerce data — that are just as important. Datasets are a tiny part of our business. Primarily, we’re an infrastructure provider. We joke internally that we’re selling picks and shovels during the gold rush.”
The company has also invested heavily in innovation, holding over 100 patents — mostly in the US. “In fact, if you see at Lithuanian companies filing US patents over the last five years, Oxylabs accounts for about 30 per cent of them. We’re very proud of our innotifyectual property team and our engineers who continue to innovate,” recounts Grybauskas.
Building an ethical industest standard
The release of ethically sourced YouTube datasets continues Oxylabs’ longtime mission to establish and promote ethical industest practices. Oxylabs also stands out for its work in creating a more ethical web and building data more accessible to not-for-profits and investigative journalists.
It’s one of the founders of the Ethical Web Data Collection Initiative, a global, industest-led group advancing responsible data aggregation. It defines best practices, promoting transparency, and supporting organisations navigate the digital ecosystem ethically.
According to Grybauskas, “When we launched the initiative with the first group of companies, we wanted to display that not all scraping is bad, and that scraping companies don’t have to be associated with botnets or shady practices.”
“We published a set of principles that define what’s acceptable and what isn’t.
Over time, more companies have expressed interest in joining, but we only accept a select few. As insiders, we know which players didn’t meet the standards. That selectivity supported us become a sort of guiding light for ethical practices in the industest.”
Web data for public good
The company is also behind pro bono Project 4β, which provides access to public web data gathering infrastructure, expertise, and legal/technical advisory to researchers, journalists, NGOs, academic institutions, and organisations engaged in social-impact missions.
It lowers the barrier to high-scale web data access for people and organisations who might not have the resources to build it all themselves. Through it, Oxylabs offers free masterclasses, training, guidance on legal, ethical, and technical aspects of public web data gathering and funding or advising academic / public-interest projects that tackle challenging questions requireding web data.
For example, Oxylabs collaborated with Lithuania’s Environmental Protection Department (EPD) to detect and tackle illegal environmental advertisements on Lithuanian online marketplaces. They utilized web crawling / scraping infrastructure to monitor listings that might violate environmental laws — for example, banned chemicals, protected species, etc. It’s a powerful example of how public institutions can adopt web innotifyigence to enforce regulation.
In Germany, Project 4β partnered with CeMAS (Centre for Monitoring, Analysis, and Strategy) which utilized the Web Scraper API to monitor news articles and content relevant to extremist mobilisation (especially around Pride events and counter-protests). The scraped data supports CeMAS track the behaviour and communication of far-right groups.
Ethical scraping starts with how you source proxies
Another initiative of Oxylabs is Honeygain, a passive income app that enables utilizers to earn money by sharing their unutilized internet bandwidth.
Once installed on a computer or phone, the app connects the device to Oxylabs’ proxy network, where the pooled bandwidth is utilized by businesses for legitimate purposes like price comparison, SEO monitoring, ad verification, and market research. Instead of relying on shady or malware-based networks, Honeygain provides a transparent, opt-in model where utilizers are compensated for their contribution.
Grybauskas explained:
“Our infrastructure relies on proxy networks—millions of IP addresses, both data centre and residential. Some companies acquire these through malware, which is unethical. We chose a different path. We launched Honeygain, the largest passive income app of its kind. “
According to Grybauskas, “in some countries, it’s just beer money; in others, it’s a meaningful addition to income.” Users can also choose ad-free app experiences in exmodify for sharing bandwidth. Consent and compensation are central to our model. However, in terms of residential proxies, Grybauskas admits that the company worries about competitors who don’t care about compliance.
“For example, after Russia’s full-scale invasion of Ukraine, we immediately cut ties with all Russian customers. Some of our competitors didn’t. For us, that was a moral decision. Ethical scraping involves the whole chain: how you obtain proxies, who you sell to, and how they utilize the data.”
















Leave a Reply