Machine Learning and Natural Language Processing

NORC uses natural language processing and machine learning methods to analyze and visualize text and data, and develop advanced tools for research.

Representative Projects

Assessing PFCE through Parent Report: Analysis of Parent Narratives and Creation of a Parent Report PFCE Measurement Tool. In collaboration with the National Head Start Association (NHSA), NORC will conduct a content analysis of original narratives describing parents’ Parent, Family, and Community Engagement (PFCE) experiences, within the Framework outlined by the Office of Head Start (OHS).  More

Conversion of Criminal History Records into Research Databases (CCHRRD) and Criminal History Record Assessment and Research Program (CHRARP). For years, The Bureau of Justice Statistics (BJS) has used information stored in the nation’s automated criminal history records to assess the officially-recognized, law-violating behavior of various samples of individuals. To conduct recidivism studies, BJS provided state criminal history repositories with identifying information on study subjects and requested each participating state repository to extract selected information on each subject’s criminal justice activities.  Because the structure and content of the data extracted from these repositories varies from state to state, it required a significant amount of manual review and coding to transform each state’s data into a commonly-formatted, researchable database. More

Identifying early Twitter marketing of Electronic Nicotine Delivery Systems (ENDS). With support from the National Cancer Institute (Grant No. U01 CA154254), the Health Media Collaboratory (HMC) continues to explore tobacco-related messages in new media such as Twitter and YouTube.  ENDS (more commonly referred to as e-cigarettes) use is frequently discussed on Twitter as these devices continue to be widely used, and the policy environment remains in flux.  HMC has applied machine learning techniques designed to efficiently and accurately distinguish the relevancy of tweets related to ENDS and to identify promotional tweets that are marketing ENDS products.  In order to address concerns about possible loopholes being exploited through aggressive product claims or marketing of nicotine-based products to minors on social media, HMC has continued this investigation of ENDS-related social media messages using supervised and unsupervised machine-learning techniques. More

More Details on HMC’s Development of Relevance Classifiers. Little cigars and cigarillos (LCCs) are an under studied domain in tobacco control and are particularly interesting because of the strategic and targeted marketing used to promote these products to youth and communities of color.  Characterizing the role of new media platforms like Twitter in tobacco product marketing and counter marketing is critically important as these platforms largely remain under the radar of tobacco control policymakers and are not currently covered by the advertising restrictions that apply to outdoor and television advertising. More

Sex Trafficking Operations Portal (STOP). The application, called “STOP” (Sex Trafficking Operations Portal), was designed by NORC to gather adult escort ads from various websites, parse and analyze the information within those sites, and display the information back to end users (law enforcement officers or their designates).  More

See all Machine Learning and Natural Language Processing projects