R, Data, and More

Welcome to Guanglai Li's Portfolio


R package development: totalcensus package

The R totalcensus package makes it easy to find out the age of people in your neighborhood, the house price in your town, and the ethnic composition of your city. It extracts data directly from the original summary files of Decennial Censuses and American Community Surveys, and hence you won't miss any piece of data the Census Bureau offers in the summary files. The summary files contain thousands of social, economic, and demographic variables in dozens of geographic entities, which can be quickly explored with the totalcensus package.

Available census geographic entities

R package development: ggtiger package

This package draws TIGER census boundaries on ggmap with a single function geom_boundary() and plot census data on map with geom_census(). As extensions to ggplot2, geom_boundary() and geom_census() work similarly to native ggplot2 geom_xxxx() functions. It currently draws boundaries of states, counties, county subdivisions, tracts, block groups, zip code tabulation areas, and congressional districts. More geographies are being added.

Gerrymandering of congressional districts

Shiny visualization: China census 2010

Want to learn China's history and society from reliable data? China has conducted a national census covering all population every ten years since 1990, collecting data such as age, gender, education and housing. The most recent one was carried out in November 01, 2010. The official summary of the census is published on the webpage of National Bureau of Statistics of China. This app conveys the information essential to understanding China through interactive visualization.

Why are the big dips?

Predict medical specialties of clinical notes

A full machine learning project from data collection to model deployment. Clinical notes were scraped from mtsamples.com. Medical named entities were extracted with Amazon Comprehension Medical and medaCy. Features were extracted with term frequency inverse document frequency (TFIDF) technique. The clinical notes were analyzed with principle component analysis, hierarchical clustering, and K-means clustering. Predictive models were built with support vector machine, XGBoost, and neural network. A best performed support vector machine model was deployed to predict specialties of new clinical notes.

Confusion matrix of the deployed model

Predict next words

This is the capstone project of online course Data Science Specialization offered by Johns Hopkins University. In this project we build a web application to predict the next words following each keystroke. This kind of applications have been widely used for text input in mobile devices such as cell phones and tablets.

Study Notes

ggplot2 minimal examples

These minimal examples help you understand the key concepts of ggplot2 and provide sample codes to generate specific plot features. A collection of over 50 minimal examples are presented, covering a wide range of features discussed in Hadley Wickham's book ggplot2: Elegant Graphics for Data Analysis (Springer, 2nd Edition).

Shiny: click on one figure to get another

A common task in interactive plotting is to click on figures to extract more information. Shiny is a handy tool for this kind of work. This interactive document demonstrates and explains how to click on a figure to get another using the data from the clicked location.

Shiny: click on a map

In this note, we will continue the discussion of Shiny click with one more example: click on a map to get something else. The purpose is to further understand the click and its application. Shiny click provides a flexible tool for interactive plotting with map.