Profile of a Campus














Campus Concept Network


The Campus Concept Network is an experimental project to mine the Web presence of the University to explore ways in which this traditionally untapped data source can be used as a profiling mechanism and help suggest shared research areas as possible avenues for cross-campus collaboration. The initial stage of this project involved crawling the top 100 most popular pages from the 307 units with web sites listed by the Division of Management Information, of which one or more pages were successfully downloaded from 272 sites. These pages were processed using computer text mining techniques to extract a list of names and concepts on all pages. The resulting list of 109,253 unique names and concepts were used to construct a single "connectivity network" with 637,971 links relating units on campus through shared names and concepts.


The results on this page are based on the analysis of the top 100 most popular pages on each site as measured by Google. The Google PageRank algorithm assigns a rank based on how many pages link to a particular page and the "authority" of those pages based on the number of pages that link to them. This circular logic eventually results in a ranking for each page measuring its overall "popularity" on the Web. The list of the top 100 most popular pages from each unit's Web site was sampled manually from Google and that set of URLs was used to create a top 100 list that was used for analysis.


The inclusion of just the top 100 pages from each site, as opposed to the entirety of each unit's Web presence, was chosen to bias the results to take into consideration public opinion of each unit. Through the examination of just the top 100 pages, it is intended to produce an approximation of the external impression of the department and its core competencies. Future work for this focus section will include repeating this process using the total collection of all pages on each unit's site. This will generate a more comprehensive network taking into account all of a unit's online material. However, in both the current and future instantiations, the results of this section will still rely entirely on the type of material that a unit puts on its Web site. If the resulting concept list of a unit does not match its actual activities, it may be an indicator that the material on its Web site does not accurately reflect the activities of the unit.


For those interested in the technical specifics of the network creation, the HTML text of each page was fed through a stripped-down version of the VIAS Page Extraction Module (PEM), which processes a page and attempts to identify its core body text, stripping away navigation bars, unrelated insets, common headers and footers, and other content that is not part of the page's core content. For the initial stage of this section of the project, it was desired to have certain non-core page elements retained for later analysis. Several options were therefore disabled in the VIAS PEM and the resulting page extractions are not as aggresive as they would usually be. The resulting text files were then processed by a part of speech tagger, and the marked text was processed by a noun phrase extraction module. Again, to faciliate certain later analysis, limited preprocessing of the marked text was performed, and the NP extractor was run with relaxed matching constraints.


Campus-Wide Network

The extremely large size of the campus-wide conceptual network made it necessary to provide only a limited portion of it on the Web. A selection of fifty concepts and names were choosen at random by a computer program, and the list of units connected by that concept or name listed beside it.

Individual Unit Concept and Name Lists