By Suzanne Ross, Writer, Microsoft Research
Sometimes the whole is not greater than the sum of its parts. Sometimes the whole doesn’t even represent its parts. Take a Web page for instance. Is all the text on a Web page a variation on the whole? Probably not. There might be weather reports mixed with tips on the newest hairdos, opinion pieces mixed with ads for whiter teeth, articles about national security mixed with links to vacations in Brazil.
What does this mean to you? Poor search results.
Microsoft Research Blog
Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more
In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.
Researchers at Microsoft Research Asia have been working diligently on algorithms to fix this. Because a Web page usually contains multiple topics, ranking the search relevance on the entire page isn’t always useful. Wei-Ying Ma, the research manager for the Web Search and Mining group, said that they don’t treat a Web page as a single unit.
A single Web page contains multiple topics and different parts of the page have difference importance. In addition, the hyperlinks often point to pages on different topics. Every Web page is made up of blocks of information. Some might match your Web search, some might not.
Search engines generally look at each Web page as a unit in assigning a page rank. If the page is viewed as a whole, the rankings might not distinguish advertising content on the page from a feature story, or a feature story from a link. Page rankings can discount the fact that the majority of the page might not have relevant content, but certain blocks of the text might be highly relevant. That means it would rank a page low in your search results even if one paragraph on that page has exactly the info you need. You’ll never find it because it’s on page ten of the search results.
“It is necessary to segment a Web page into semantically independent units or blocks so that noisy information, such as ads, can be filtered out, and multiple topics can be distinguished,” said Ma.
The researchers found that breaking the page up using visual cues takes advantage of the characteristics of a Web page. Web pages contain a lot of visual information in HTML tags and properties. Typical visual hints are lines, blank areas, colors, pictures, and fonts. These visual cues make it easy to detect semantic regions or blocks.
They developed an algorithm called Vision-based Page Segmentation (VIPS) which takes various visual cues into account to find the content structure of a Web page. However, they found that VIPS didn’t completely solve the problem because it didn’t allow for varying length problems. So they used a combined algorithm that considered both visual cues and length normalization.
Once the Web page is segmented into blocks, the researchers can assign value to each block to determine how closely it might match your search query. They look at the position of the block on the page — blocks closer to the center of a page are usually more important. They look at the size of the block, since larger blocks of content will usually dominate the overall meaning of the page.
They also analyze the links on a page to determine block importance. If a link is a navigational link, or a link to an advertisement, the system will rank the block in which the link is contained of lower importance. This helps remove ‘noisy’ information such as ads, menus, and decoration from the page ranking.
Though this is still a prototype, they have gotten good results from their initial research. By analyzing the page-to-block relationship, or page layout, and the block-to-page relationship, which is link analysis, they can significantly improve the results you get back on a search query.