Every data science project starts with data. We need to acquire a huge amount of data to train our machine learning models. There are various ways to collect data. Surfing websites and downloading the structured datasets present on them is one of the most common methods for data collection. But there are times when this data is not enough. Certain problem statement datasets are not easily available on the web. And to deal with this situation, we need to create our own datasets.
In this article, we will discuss the method of creating a custom image dataset and labeling it using Python. First, let us talk about acquiring images through web scraping.
Web Scraping refers to the process of data scraping from websites. It surfs the world wide web and stores the extracted data in the system. Beautifulsoup is one of the most popular Python libraries for image scraping. The requests library requests the essential webpage.
When we go to the developer tool by clicking on a picture on the webpage, there displays a format starting with images.pexel.com/photos after which a number is listed, which is unique for every photo. One can get a similar image using the regex (regular expression).
Using this method, our images get scrapped. We can also print the links if we want to see those links and make a directory of them. After this, we will download the images. Once the process is complete, you can see the scrapped images through the specified path where images are stored.
After scrapping and storing the images, we need to classify them through labeling. Labeling software is used for this purpose. It is a pip installable annotation tool. It provides two annotations YOLO and PASCAL VOC.
You can open the labeling software using the command: (base) C:\Users\Jayita\labeling
There will be specified options on the left-hand side of the screen. On the right-hand side, you will see the image file information. Select ‘Open dir’ to see all images. Press ‘a’ to view the previous image and ‘d’ to view the next image.
To get the annotations, draw a rectangular box and press ‘w’. A window will pop up to store the image’s class name. Once you are done with drawing the box and labeling the image, it’s time to save it. To generate the annotations, you need to store the image in PASCAL VOC or YOLO format.
One can learn about this in detail in a data science course. Web scrapping and labeling is not a hard process once you understand the basics of it. You need to be careful while scrapping a website and obey the rules so that you do not harm the website you are scrapping. Take time to consider your requirements and research accordingly to find a suitable website for this process. For example, if you plan to develop a model for fashion, then online shopping websites should be on your scrapping list.
Learning web scraping and labeling is important if you want to build a data science career in the future. It will provide you with a deep understanding of image datasets. You can use these techniques to increase the data in situations where available data for a project is less. You can apply this process to multiple classes if they share the same folder and get the desired results.