沈梦圆的博客 怕什么真理无穷, 进一寸有一寸的欢喜。

【R】《R for Data Science》学习笔记-数据探索篇


本篇的学习目的是快速掌握数据探索的工具。观察数据、提出假设、快速检验,重复、重复、重复。提出越多假设,探索数据就更为深入。

data-science-explore

Data visualisation

R有好几个绘图系统,我们学习用ggplot2进行数据的可视化,因为它最为优雅且全能的。如果想要学它的绘图理论可以看 ”The Layered Grammar of Graphics“。

Question

  • 汽车发动机越大越耗油不?
  • 发动机的大小和油耗是啥关系?正相关?负相关?线性的?非线性的?
# install.packages("tidyverse")
library(tidyverse)
# -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --
# √ ggplot2 2.2.1     √ purrr   0.2.4
# √ tibble  1.3.4     √ dplyr   0.7.4
# √ tidyr   0.7.2     √ stringr 1.2.0
# √ readr   1.1.1     √ forcats 0.2.0
# -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
#   x dplyr::filter() masks stats::filter()
# x dplyr::lag()    masks stats::lag()

tidyverse能够帮助我们安装和加载一系列数据处理的包,方便快捷。

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

发动机大小和油耗直接的散点图,可以知道发动机越大,油耗相对越小。

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

以上是绘图模板。

Aesthetic

  • 连续型变量只能aesthetic是连续型的属性 color,size,但 shape就不可以(它是非连续型变量);
  • stroke aesthetic 能够改变点的大小(geom_point);
  • aes(colour = displ < 5)则会根据displ判断是否小于5产生F/T逻辑值来分配颜色;

Facets

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

This is generally a better use of screen space than facet_grid because most displays are roughly rectangular.

facet_grid(class ~ . )class在左边表示按行分面,如果在右边就是按列分面;

Geometric objects

  • ggplot2扩展Y叔的ggtree也在里面;
  • show.legend = FALSE可以把右边图标给去除;

Stat

  • In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
    ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
    ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, y = ..prop..))
    ggplot(data = diamonds) + 
    geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
    
  • geom_colgeom_bar的区别

Position

  • position = "identity"
  • position = "fill"
  • position = "dodge"
  • geom_point(position = "jitter"): geom_jitter()
  • ?position_dodge, ?position_fill, ?position_identity, ?position_jitter, and ?position_stack

Coordinate systems

  • coord_flip() switches the x and y axes
  • coord_quickmap() sets the aspect ratio correctly for maps
  • coord_polar() uses polar coordinates

Summary

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Workflow: basics

  • Cmd/Ctrl + ↑查找代码记录
  • Alt + Shift + K保存代码

Data transformation

library(nycflights13)
library(tidyverse)
  • int stands for integers.
  • dbl stands for doubles, or real numbers.
  • chr stands for character vectors, or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

dplyr basics

  • Pick observations by their values (filter()).

  • Reorder the rows (arrange()).

  • Pick variables by their names (select()).

  • Create new variables with functions of existing variables (mutate()).

  • Collapse many values down to a single summary (summarise()).

  • group_by() changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

  • Every number you see is an approximation. Instead of relying on ==, use near().

  • Logical operators:

    logicalx is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects

filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
filter(flights,year %in% c(2013,2014)
filter(flights,between(year,2013,2014)
filter(flights,year == 2013 | year == 2014)
  • order:arrange() ,desc()
  • select()choose variables:starts_with("abc"),ends_with("xyz"),contains("ijk"),matches("(.)\\1"),num_range("x", 1:3) matches x1, x2 and x3

  • rename(): rename variables

  • move time_hour/air_timeto the start of the data frame:select(flights, time_hour, air_time, everything())

  • comtains()不区分大小写(不知道怎么解决)

  • Add new variables with mutate()

  • only keep the new variables, use transmute()

  • log(), log2(), log10():对数是处理跨越多个数量级的数据的非常有用的转换。他们还将乘法关系转换为可加性,这是我们将在建模中回归的一个特征

  • Offsets: lead() and lag()(暂时没有体会)

  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean();RcppRoll package

  • Ranking:min_rank()(排序的序号?),row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()

  • Grouped summaries with summarise(),group_by()

  • Combining multiple operations with the pipe:
  by_dest <- group_by(flights, dest)
  delay <- summarise(by_dest,
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
  delay <- filter(delay, count > 20, dest != "HNL")

以上等同于,下面的pipe:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

  # It looks like delays increase with distance up to ~750 miles 
  # and then decrease. Maybe as flights get longer there's more 
  # ability to make up delays in the air?
  ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
    geom_point(aes(size = count), alpha = 1/3) +
    geom_smooth(se = FALSE)
  #> `geom_smooth()` using method = 'loess'

Similar Posts

Comments